CH 23
CH 23
Processing
Local transactions
• Access/update data at only one database
Global transactions
• Access/update data at more than one database
Key issue: how to ensure ACID properties for transactions in a system
with global transactions spanning multiple database
Database System Concepts - 7th Edition 23.2 ©Silberschatz, Korth and Sudarshan
Distributed Transactions
Database System Concepts - 7th Edition 23.3 ©Silberschatz, Korth and Sudarshan
Distributed Transactions
Database System Concepts - 7th Edition 23.4 ©Silberschatz, Korth and Sudarshan
System Failure Modes
Database System Concepts - 7th Edition 23.5 ©Silberschatz, Korth and Sudarshan
Commit Protocols
Database System Concepts - 7th Edition 23.6 ©Silberschatz, Korth and Sudarshan
Two Phase Commit Protocol (2PC)
Execution of the protocol is initiated by the coordinator after the last step of
the transaction has been reached.
The protocol involves all the local sites at which the transaction executed
Protocol has two phases
Let T be a transaction initiated at site Si, and let the transaction
coordinator at Si be Ci
Database System Concepts - 7th Edition 23.7 ©Silberschatz, Korth and Sudarshan
Phase 1: Obtaining a Decision
Database System Concepts - 7th Edition 23.8 ©Silberschatz, Korth and Sudarshan
Phase 2: Recording the Decision
Database System Concepts - 7th Edition 23.9 ©Silberschatz, Korth and Sudarshan
Two-Phase Commit Protocol
Database System Concepts - 7th Edition 23.10 ©Silberschatz, Korth and Sudarshan
Handling of Failures - Site Failure
Database System Concepts - 7th Edition 23.11 ©Silberschatz, Korth and Sudarshan
Handling of Failures- Coordinator Failure
Database System Concepts - 7th Edition 23.12 ©Silberschatz, Korth and Sudarshan
Handling of Failures - Network Partition
If the coordinator and all its participants remain in one partition, the failure
has no effect on the commit protocol.
If the coordinator and its participants belong to several partitions:
• Sites that are not in the partition containing the coordinator think the
coordinator has failed, and execute the protocol to deal with failure of
the coordinator.
No harm results, but sites may still have to wait for decision from
coordinator.
The coordinator and the sites are in the same partition as the coordinator
think that the sites in the other partition have failed, and follow the usual
commit protocol.
Again, no harm results
Database System Concepts - 7th Edition 23.13 ©Silberschatz, Korth and Sudarshan
Recovery and Concurrency Control
Database System Concepts - 7th Edition 23.14 ©Silberschatz, Korth and Sudarshan
Avoiding Blocking During Consensus
Database System Concepts - 7th Edition 23.15 ©Silberschatz, Korth and Sudarshan
Using Consensus to Avoid Blocking
Database System Concepts - 7th Edition 23.16 ©Silberschatz, Korth and Sudarshan
Distributed Transactions via Persistent Messaging
Database System Concepts - 7th Edition 23.17 ©Silberschatz, Korth and Sudarshan
Persistent Messaging
Database System Concepts - 7th Edition 23.18 ©Silberschatz, Korth and Sudarshan
Error Conditions with Persistent Messaging
Database System Concepts - 7th Edition 23.19 ©Silberschatz, Korth and Sudarshan
Persistent Messaging Implementation
Database System Concepts - 7th Edition 23.20 ©Silberschatz, Korth and Sudarshan
Persistent Messaging (Cont.)
Receiving site may get duplicate messages after a very long delay
• To avoid keeping processed messages indefinitely
Messages are given a timestamp
Received messages older than some cutoff are ignored
Stored messages older than the cutoff can be deleted at receiving
site
Workflows provide a general model of transactional processing involving
multiple sites and possibly human processing of certain steps
• E.g., when a bank receives a loan application, it may need to
Contact external credit-checking agencies
Get approvals of one or more managers
and then respond to the loan application
• Persistent messaging forms the underlying infrastructure for workflows
in a distributed environment
Database System Concepts - 7th Edition 23.21 ©Silberschatz, Korth and Sudarshan
Concurrency Control in
Distributed Databases
Database System Concepts - 7th Edition 23.22 ©Silberschatz, Korth and Sudarshan
Concurrency Control
Database System Concepts - 7th Edition 23.23 ©Silberschatz, Korth and Sudarshan
Single-Lock-Manager Approach
Database System Concepts - 7th Edition 23.24 ©Silberschatz, Korth and Sudarshan
Distributed Lock Manager
Database System Concepts - 7th Edition 23.25 ©Silberschatz, Korth and Sudarshan
Deadlock Handling
Consider the following two transactions and history, with item X and
transaction T1 at site 1, and item Y and transaction T2 at site 2:
Database System Concepts - 7th Edition 23.26 ©Silberschatz, Korth and Sudarshan
Deadlock Detection
Database System Concepts - 7th Edition 23.27 ©Silberschatz, Korth and Sudarshan
Local and Global Wait-For Graphs
Local
Global
Database System Concepts - 7th Edition 23.28 ©Silberschatz, Korth and Sudarshan
Example Wait-For Graph for False Cycles
Initial state:
Database System Concepts - 7th Edition 23.29 ©Silberschatz, Korth and Sudarshan
False Cycles (Cont.)
Database System Concepts - 7th Edition 23.30 ©Silberschatz, Korth and Sudarshan
Distributed Deadlocks
Database System Concepts - 7th Edition 23.31 ©Silberschatz, Korth and Sudarshan
Leases
Database System Concepts - 7th Edition 23.32 ©Silberschatz, Korth and Sudarshan
Leases (Cont.)
Coordinator must check that it still has lease when performing action
• Due to delay between check and action, must check that expiry is at
least some time t’ into the future
t’ includes delay in processing and maximum network delay
Old messages must be ignored
Leases depend on clock synchronization
Database System Concepts - 7th Edition 23.33 ©Silberschatz, Korth and Sudarshan
Distributed Timestamp-Based Protocols
Database System Concepts - 7th Edition 23.34 ©Silberschatz, Korth and Sudarshan
Distributed Timestamps
Database System Concepts - 7th Edition 23.35 ©Silberschatz, Korth and Sudarshan
Distributed Timestamp Ordering
Database System Concepts - 7th Edition 23.36 ©Silberschatz, Korth and Sudarshan
Distributed Validation
Database System Concepts - 7th Edition 23.37 ©Silberschatz, Korth and Sudarshan
Distributed Validation (Cont.)
Database System Concepts - 7th Edition 23.38 ©Silberschatz, Korth and Sudarshan
Distributed Validation (Cont.)
Database System Concepts - 7th Edition 23.39 ©Silberschatz, Korth and Sudarshan
Replication
Database System Concepts - 7th Edition 23.40 ©Silberschatz, Korth and Sudarshan
Replication
Database System Concepts - 7th Edition 23.41 ©Silberschatz, Korth and Sudarshan
Consistency of Replicas
Consistency of replicas
• Ideally: all replicas should have the same value updates performed
at all replicas
But what if a replica is not available (disconnected, or failed)?
• Suffices if reads get correct value, even if some replica is out of date
• Above idea formalized by linearizability: given a set of read and write
operations on a (replicated) data item
There must be a linear ordering of operations such that each read
sees the value written by the most recent preceding write
If o1 finishes before o2 begins (based on external time), then o1
must precede o2 in the linear order
Note that linearizability only addresses a single (replicated) data item;
serializability is orthogonal
Database System Concepts - 7th Edition 23.42 ©Silberschatz, Korth and Sudarshan
Consistency of Replicas
Database System Concepts - 7th Edition 23.43 ©Silberschatz, Korth and Sudarshan
Concurrency Control With Replicas
Database System Concepts - 7th Edition 23.44 ©Silberschatz, Korth and Sudarshan
Concurrency Control With Replicas (Cont.)
Majority protocol:
• Transaction requests locks at multiple/all replicas
• Lock is successfully acquired on the data item only if lock obtained
at a majority of replicas
Benefit: resilient to node failures and node failures
• Processing can continue as long as at least a majority of replicas are
accessible
Overheads
• Higher cost due to multiple messages
• Possibility of deadlock even when locking single item
How can you avoid such deadlocks?
Database System Concepts - 7th Edition 23.45 ©Silberschatz, Korth and Sudarshan
Concurrency Control With Replicas (Cont.)
Biased protocol
• Shared lock can be obtained on any replica
Reduces overhead on reads
• Exclusive lock must be obtained on all replicas
Blocking if any replica is unavailable
Database System Concepts - 7th Edition 23.46 ©Silberschatz, Korth and Sudarshan
Quorum Consensus Protocol
Database System Concepts - 7th Edition 23.47 ©Silberschatz, Korth and Sudarshan
Dealing with Failures
Read one write all copies protocol assumes all copies are available
• Will block if any site is not available
Read one write all available (ignoring failed sites) is attractive, but
incorrect
• Failed link may come back up, without a disconnected site ever being
aware that it was disconnected
• The site then has old values, and a read from that site would return
an incorrect value
• With network partitioning, sites in each partition may update same
item concurrently
believing sites in other partitions have all failed
Database System Concepts - 7th Edition 23.48 ©Silberschatz, Korth and Sudarshan
Handling Failures with Majority Protocol
Database System Concepts - 7th Edition 23.49 ©Silberschatz, Korth and Sudarshan
Handling Failures with Majority Protocol
Database System Concepts - 7th Edition 23.50 ©Silberschatz, Korth and Sudarshan
Handling Failures with Majority Protocol
Database System Concepts - 7th Edition 23.51 ©Silberschatz, Korth and Sudarshan
Reducing Read Cost
Database System Concepts - 7th Edition 23.52 ©Silberschatz, Korth and Sudarshan
Reducing Read Cost
Chain replication:
• Variant of primary copy scheme
• Replicas are organized into a chain
• Writes are done at head of chain, and passed on to subsequent
replicas
• Reads performed at tail
Ensures that read will get only fully replicated copy
• Any node failure requires reconfiguration of chain
Database System Concepts - 7th Edition 23.53 ©Silberschatz, Korth and Sudarshan
Reconfiguration and Reintegration
Database System Concepts - 7th Edition 23.54 ©Silberschatz, Korth and Sudarshan
Reconfiguration
Reconfiguration:
• Abort all transactions that were active at a failed site
• If replicated data items were at failed site, update system catalog to
remove them from the list of replicas.
This should be reversed when failed site recovers, but additional
care needs to be taken to bring values up to date
• If a failed site was a central server for some subsystem, an election
must be held to determine the new server
E.g., name server, concurrency coordinator, global deadlock
detector
Database System Concepts - 7th Edition 23.55 ©Silberschatz, Korth and Sudarshan
Reconfiguration (Cont.)
Since network partition may not be distinguishable from site failure, the
following situations must be avoided
• Two or more central servers elected in distinct partitions
• More than one partition updates a replicated data item
Updates must be able to continue even if some sites are down
Solution: majority based approach
Database System Concepts - 7th Edition 23.56 ©Silberschatz, Korth and Sudarshan
Site Reintegration
When failed site recovers, it must catch up with all updates that it
missed while it was down
• Problem: updates may be happening to items whose replica is
stored at the site while the site is recovering
• Solution 1: halt all updates on system while reintegrating a site
Unacceptable disruption
• Solution 2: lock all replicas of all data items at the site, update to
latest version, then release locks
Can do this for one partition at a time
Database System Concepts - 7th Edition 23.57 ©Silberschatz, Korth and Sudarshan
Comparison with Remote Backup
Remote backup systems (Section 19.7) are also designed to provide high
availability
Remote backup systems are simpler and have lower overhead
• All actions performed at a single site, and only log records shipped
• No need for distributed concurrency control, or 2 phase commit
Using distributed databases with replicas of data items can provide higher
availability by having multiple (> 2) replicas and using the majority
protocol
• Also avoid failure detection and switchover time associated with
remote backup systems
Database System Concepts - 7th Edition 23.58 ©Silberschatz, Korth and Sudarshan
Extended Concurrency Control Protocols
Database System Concepts - 7th Edition 23.59 ©Silberschatz, Korth and Sudarshan
Multiversion 2PL and Globally Consistent Timestamps
Database System Concepts - 7th Edition 23.60 ©Silberschatz, Korth and Sudarshan
Multiversion 2PL and Globally Consistent Timestamps
Database System Concepts - 7th Edition 23.61 ©Silberschatz, Korth and Sudarshan
Multiversion 2PL and Globally Consistent Timestamps
Database System Concepts - 7th Edition 23.62 ©Silberschatz, Korth and Sudarshan
Other Concurrency Control Techniques
Database System Concepts - 7th Edition 23.63 ©Silberschatz, Korth and Sudarshan
Replication With Weak Degrees of Consistency
Database System Concepts - 7th Edition 23.64 ©Silberschatz, Korth and Sudarshan
Consistency
Database System Concepts - 7th Edition 23.65 ©Silberschatz, Korth and Sudarshan
Availability
Database System Concepts - 7th Edition 23.66 ©Silberschatz, Korth and Sudarshan
CAP “Theorem”
Database System Concepts - 7th Edition 23.67 ©Silberschatz, Korth and Sudarshan
CAP “Theorem” (Cont.)
Database System Concepts - 7th Edition 23.68 ©Silberschatz, Korth and Sudarshan
Replication with Weak Consistency
Database System Concepts - 7th Edition 23.69 ©Silberschatz, Korth and Sudarshan
Eventual Consistency
When no updates occur for a long period of time, eventually all updates
will propagate through the system and all the nodes will be consistent
For a given accepted update and a given node, eventually either the
update reaches the node or the node is removed from service
Known as BASE (Basically Available, Soft state, Eventual consistency),
as opposed to ACID
• Soft state: copies of a data item may be inconsistent
• Eventually Consistent : Copies may be allowed to become
inconsistent, but (once partitioning is resolved) eventually all copies
become consistent with each other
at some later time, if there are no more updates to that data item
Database System Concepts - 7th Edition 23.70 ©Silberschatz, Korth and Sudarshan
Asynchronous Replication
Database System Concepts - 7th Edition 23.71 ©Silberschatz, Korth and Sudarshan
Asynchronous Replication
Database System Concepts - 7th Edition 23.72 ©Silberschatz, Korth and Sudarshan
Asynchronous View Maintenance
Database System Concepts - 7th Edition 23.73 ©Silberschatz, Korth and Sudarshan
Requirements for Asynchronous View Maintenance
Requirements:
1. Updates must be delivered and processed exactly once despite failures
2. Derived data (such as materialized views/indices) must be updated in
such a way that it will be consistent with the underlying data
• Formalized as eventual consistency: if there are no updates for a
while, eventually the derived data will be consistent with the
underlying data
3. Queries should get a transactionally consistent view of derived data
• Potentially a problem with long queries that span multiple nodes
• E.g., without transactional consistency, a scan of relation may miss
some older updates and see some later updates
• Not supported by many systems, supported via snapshots in some
systems
Database System Concepts - 7th Edition 23.74 ©Silberschatz, Korth and Sudarshan
Detecting Inconsistency
Database System Concepts - 7th Edition 23.75 ©Silberschatz, Korth and Sudarshan
Vector Vectors
Database System Concepts - 7th Edition 23.76 ©Silberschatz, Korth and Sudarshan
Example of Vector Clock in action
Database System Concepts - 7th Edition 23.77 ©Silberschatz, Korth and Sudarshan
Extensions for Detecting Inconsistency
Two replicas may diverge, and divergence is not detected until the
replicas is read
• To detect divergence early, one approach is to scan all replicas of all
items periodically
But requires a lot of network, CPU and I/O load
Alternative approach based on Merkle trees covered shortly
Database System Concepts - 7th Edition 23.78 ©Silberschatz, Korth and Sudarshan
How to Reconcile Inconsistent Versions?
Database System Concepts - 7th Edition 23.79 ©Silberschatz, Korth and Sudarshan
Order Independent Operations
Database System Concepts - 7th Edition 23.80 ©Silberschatz, Korth and Sudarshan
Detecting Differences Using Merkle Trees
Database System Concepts - 7th Edition 23.81 ©Silberschatz, Korth and Sudarshan
Detecting Differences Using Merkle Trees (Cont.)
Database System Concepts - 7th Edition 23.82 ©Silberschatz, Korth and Sudarshan
Weak Consistency Models for Applications
Read-your-writes
• if a process has performed a write, a subsequent read will reflect the
earlier write operation
Session consistency
• Read-your-writes in the context of a session, where application
connects to storage system
Monotonic consistency
• For reads: later reads never return older version than earlier reads
• For writes: serializes writes by a single process
Minimum requirement
Sticky sessions: all operations from a session on a data item go to the
same node
Can be implemented by specifying a version vector in get() operations
• Result of get guaranteed to be at least as new as specified version
vector
Database System Concepts - 7th Edition 23.83 ©Silberschatz, Korth and Sudarshan
Coordinator Selection
Database System Concepts - 7th Edition 23.84 ©Silberschatz, Korth and Sudarshan
Coordinator Selection
Backup coordinators
• Backup coordinator maintains enough information locally to assume
the role of coordinator if the actual coordinator fails
executes the same algorithms and maintains the same internal
state information as the actual coordinator
• allows fast recovery from coordinator failure but involves overhead
during normal processing.
Backup coordinator approach vulnerable to two-site failure
• Failure of coordinator and backup leads to non-availability
• Key question: how to choose a new coordinator from a set of
candidates
Choice done by a master: vulnerable to master failure
Election algorithms are key
Database System Concepts - 7th Edition 23.85 ©Silberschatz, Korth and Sudarshan
Coordinator Selection
Coordinator selection using a fault-tolerant lock manager
• Coordinator gets a lease on a coordinator lock, and renews the lease
as long as it is alive
• If coordinator dies or gets disconnected, lease is losk
• Other nodes can detect coordinator failure using heart-beat messages
• Nodes request coordinator lock lease from lock manager; only 1 node
gets the lease, and becomes new coordinator
Fault-tolerant coordination services such as ZooKeeper, Chubby
• Provide fault-tolerant lock management services
• And are widely used for coordinator section
• Store (small amounts) of data in files
• Create and delete files
Which can be used as locks/leases
• Coordinator releases lease if it is not renewed in time
• Can watch for changes on a file
• But these services themselves need a coordinator……..
Database System Concepts - 7th Edition 23.86 ©Silberschatz, Korth and Sudarshan
Election of Coordinator
Election algorithms
• Used to elect a new coordinator in case of failures
Heartbeat messages used to detect failure of coordinator
• One-time election protocol
Proposers: Nodes that propose themselves as coordinator and send
vote requests to other nodes
Acceptors: Nodes that can vote for candidate proposers
Learners: Nodes that ask acceptors who they voted for, to find
winner
• A node can perform all above roles
Problems with this protocol
• What if no one won the election due to split vote?
• If election is rerun, need to identify which election a request is for
• General approach
Candidates make a proposal with a term number
• Term number is 1 more than term number of previous election
known to candidate
Database System Concepts - 7th Edition 23.87 ©Silberschatz, Korth and Sudarshan
Election of Coordinator
Election algorithms (Cont.)
• Stale messages corresponding to old terms can be ignored
If a candidate wins majority vote it becomes coordinator
Otherwise election is rerun with term number incremented
Minimizing chances of split elections:
• Use node IDs to decide who to vote for
e.g., max node ID (Bully algorithm)
Candidates withdraw if they find another candidate with
higher ID
• Randomized retry: candidates wait for random time intervals
before retrying
High probability that only one node is asking to be elected at
a time
• Special case of distributed consensus
Database System Concepts - 7th Edition 23.88 ©Silberschatz, Korth and Sudarshan
Issues with Multiple Coordinators
Database System Concepts - 7th Edition 23.89 ©Silberschatz, Korth and Sudarshan
Distributed Consensus
Database System Concepts - 7th Edition 23.90 ©Silberschatz, Korth and Sudarshan
Distributed Consensus
Database System Concepts - 7th Edition 23.91 ©Silberschatz, Korth and Sudarshan
Distributed Consensus
Database System Concepts - 7th Edition 23.92 ©Silberschatz, Korth and Sudarshan
Distributed Consensus: Overview
Database System Concepts - 7th Edition 23.93 ©Silberschatz, Korth and Sudarshan
Distributed Consensus: Overview (Cont.)
Database System Concepts - 7th Edition 23.94 ©Silberschatz, Korth and Sudarshan
Paxos Consensus Protocol
Database System Concepts - 7th Edition 23.95 ©Silberschatz, Korth and Sudarshan
Paxos Consensus Protocol: Overview
Database System Concepts - 7th Edition 23.97 ©Silberschatz, Korth and Sudarshan
Paxos: Overview
Database System Concepts - 7th Edition 23.98 ©Silberschatz, Korth and Sudarshan
Paxos Made Simple
Phase 1
• Phase 1a: A proposer selects a proposal number n and sends a
prepare request with number n to a majority of acceptors
Number has to be chosen in some unique way
• Phase 1b: If an acceptor receives a prepare request with number n
If n is less than that of any prepare request to which it has already
responded then it ignores the request
Else it remembers n and responds to the request
• If it has already accepted a proposal with number m and value
v, it sends (m, v) with the response
• Otherwise it indicates to the proposer that it has not accepted
any value earlier
• NOTE: responding is NOT the same as accepting
Database System Concepts - 7th Edition 23.99 ©Silberschatz, Korth and Sudarshan
Paxos Made Simple
Phase 2
• Phase 2a: Proposer Algorithm: If the proposer receives a response to
its prepare requests (numbered n) from a majority of acceptors
then it sends an accept request to each of those acceptors for a
proposal numbered n with a value v, where v is
• the value selected by the proposer if none of the acceptors
indicated it had already accepted a value.
• Otherwise v is the value of the highest-numbered proposal
among the responses
i.e., proposer backsoff from its own proposal and votes for
highest numbered proposal already accepted by at least
one acceptor
If proposer does not hear from a majority it takes no further action
in this round
Database System Concepts - 7th Edition 23.100 ©Silberschatz, Korth and Sudarshan
Paxos Made Simple
Phase 2
• Phase 2b: Acceptor Algorithm: If an acceptor receives an accept
request for a proposal numbered n,
If it has earlier responded to a prepare message with number n1
> n it ignores the message
Otherwise it accepts the proposed value v with number n.
• Note: acceptor may accept different values with increasing n
Database System Concepts - 7th Edition 23.101 ©Silberschatz, Korth and Sudarshan
Paxos Details
Key idea: if a majority of acceptors accept a value v (with number n), then
even if there are further proposals with number n1 > n, the value proposed
will be value v
• Why?:
A value can be accepted with number n only if a majority of nodes
(say P) respond to a prepare message with number n
Any subsequent majority (say A) will have nodes in common with
the first majority P, and at least one of those nodes would have
responded with value v and number n
• If a higher numbered proposal p was accepted earlier by even
one node majority would have responded to p, and will ignore
n
Further rounds will use this value v (since highest accepted value
is used in Phase 2a)
Database System Concepts - 7th Edition 23.102 ©Silberschatz, Korth and Sudarshan
Paxos Details (Cont.)
Database System Concepts - 7th Edition 23.103 ©Silberschatz, Korth and Sudarshan
The Raft Consensus Protocol
Database System Concepts - 7th Edition 23.104 ©Silberschatz, Korth and Sudarshan
The Log-Based Consensus Protocols
Database System Concepts - 7th Edition 23.105 ©Silberschatz, Korth and Sudarshan
The Raft Consensus Algorithm
Database System Concepts - 7th Edition 23.106 ©Silberschatz, Korth and Sudarshan
The Raft Leader Election
Database System Concepts - 7th Edition 23.107 ©Silberschatz, Korth and Sudarshan
Example of Raft Logs
Database System Concepts - 7th Edition 23.108 ©Silberschatz, Korth and Sudarshan
Raft Log Replication
Database System Concepts - 7th Edition 23.109 ©Silberschatz, Korth and Sudarshan
Raft AppendEntries Procedure
Database System Concepts - 7th Edition 23.110 ©Silberschatz, Korth and Sudarshan
Raft AppendEntries Procedure (Cont.)
Database System Concepts - 7th Edition 23.111 ©Silberschatz, Korth and Sudarshan
Raft Leader Replacement
Raft protocol ensures any node elected as leader has all committed log
entries
• Candidate must send information about its own log state when
seeking votes
• Node votes for candidate only if candidates log state is at least as up-
to-date as its own (we omit details)
• Since majority have voted for new leader, any committed log entry
will be in new leaders log
Raft forces all other nodes to replicate leaders log
• Log records at new leader may get committed when log gets
replicated
• Leader cannot count number of replicas with a record from an earlier
term and declare it committed if it is at majority
Details are subtle, and omitted
Instead, leader must replicate a new log record in its current term
Database System Concepts - 7th Edition 23.112 ©Silberschatz, Korth and Sudarshan
Raft Protocol
There are many more subtle details that need to be taken care of
• Consistency even in face of multiple failures and restarts
• Maintaining cluster membership, cluster membership changes
Raft has been proven formally correct
See bibliographic notes for more details of above
Database System Concepts - 7th Edition 23.113 ©Silberschatz, Korth and Sudarshan
Fault-Tolerant Services using
Replicated State Machines
Key requirement: make a service fault tolerant
• E.g., lock manager, key-value storage system, ….
State machines are a powerful approach to creating such services
A state machine
• Has a stored state, and receives inputs
• Makes state transitions on each input, and may output some results
Transitions and output must be deterministic
A replicated state machine is a state machine that is replicated on multiple
nodes
• All replicas must get exactly the same inputs
Replicated log! State machine processes only committed inputs!
• Even if some of the nodes fail, state and output can be obtained from other
nodes
Database System Concepts - 7th Edition 23.114 ©Silberschatz, Korth and Sudarshan
Replicated State Machine
Database System Concepts - 7th Edition 23.115 ©Silberschatz, Korth and Sudarshan
Uses of Replicated State Machines
Database System Concepts - 7th Edition 23.116 ©Silberschatz, Korth and Sudarshan
Uses of Replicated State Machines
Database System Concepts - 7th Edition 23.117 ©Silberschatz, Korth and Sudarshan
Two-Phase Commit Using Consensus
Database System Concepts - 7th Edition 23.118 ©Silberschatz, Korth and Sudarshan
End of Chapter 23
Database System Concepts - 7th Edition 23.119 ©Silberschatz, Korth and Sudarshan
Extra Slides – Material Not in Text
Weak Consistency
Miscellaneous
Database System Concepts - 7th Edition 23.120 ©Silberschatz, Korth and Sudarshan
Dynamo: Basics
Database System Concepts - 7th Edition 23.121 ©Silberschatz, Korth and Sudarshan
Performing Put/Get Operations
Database System Concepts - 7th Edition 23.122 ©Silberschatz, Korth and Sudarshan
How to Reconcile Inconsistent Versions?
Database System Concepts - 7th Edition 23.123 ©Silberschatz, Korth and Sudarshan
Availability vs Latency
Database System Concepts - 7th Edition 23.124 ©Silberschatz, Korth and Sudarshan
Amazon Dynamo
Database System Concepts - 7th Edition 23.125 ©Silberschatz, Korth and Sudarshan
Bully Algorithm Details
Database System Concepts - 7th Edition 23.126 ©Silberschatz, Korth and Sudarshan
Bully Algorithm (Cont.)
If no message is sent within T’, assume the site with a higher number has
failed; Si restarts the algorithm.
After a failed site recovers, it immediately begins execution of the same
algorithm.
If there are no active sites with higher numbers, the recovered site forces
all processes with lower numbers to let it become the coordinator site, even
if there is a currently active coordinator with a lower number.
Database System Concepts - 7th Edition 23.127 ©Silberschatz, Korth and Sudarshan