The Raft Consensus Algorithm
Lecture Notes for CS 190
Winter 2020
John Ousterhout
Motivation: Replicated State Machines
- Service that is replicated on multiple machines:
- All instances produce the same behavior
- Fault tolerant: Available for service as long as a majority of
the servers are up
Raft Basics
- Leader based:
- Pick one server to be leader
- It accepts client requests, manages replication
- Pick a new leader if the current one fails
- Server states:
- Leader
- Follower: totally passive
- Candidate: used for electing new leaders
- Time divided into terms:
- Terms have numbers that increase monotonically
- Each term starts with an election
- One or more candidates attempting to become leader
- Winning candidate (if any) serves as leader for the rest of
the term
- Terms allow detection of stale information
- Each server stores current term
- Checked on every request
- Term numbers increase monotonically
- Different servers observe term transitions differently
- Request-response protocol between servers (remote procedure calls,
or RPCs). 2 request types:
- RequestVote
- AppendEntries (also serves as heartbeat)
Leader Election
- All servers start as followers
- No heartbeat (AppendEntries)? Start election:
- Increment current term
- Vote for self
- Send RequestVote requests to all other servers
- Election outcomes:
- Receive votes from majority of cluster: become leader,
start sending heartbeats to prevent additional elections
- Receive AppendEntries from some other server in the new term:
revert back to follower
- Timeout: start a new election
- Each server votes for at most one candidate in a given term
- Election Safety: only one server can be elected leader in a given term
- Availability: randomized election timeouts reduce split votes
Log Replication
- Handled by leader
- When client request arrives:
- Append to local log
- Replicate to other servers using AppendEntries requests
- Committed: entry replicated by leader to majority of servers
- Once committed, apply to local state machine
- Return result to client
- Notify other servers of committed entries in future AppendEntries requests
- Log entries: index, term, command
- Logs can become inconsistent after leader crashes
- Raft maintains a high level of coherency between logs (Log Matching Property):
- If entries in different logs have same term and index, then
- They also have the same command
- The two logs are identical up through that entry
- AppendEntries consistency check preserves above properties.
- Leader forces other logs to match its own:
nextIndex
for each follower (initialized to leader's
log length)
- If AppendEntries fails, reduce
nextIndex
for that
follower and retry.
- If follower receives conflicting entries but consistency check
passes, removes all conflicting entries
Safety
- Must ensure that the leader for new term always holds all of the
log entries committed in previous terms (Leader Completeness Property).
- Step 1: restriction on elections: don't vote for a candidate unless
candidate's log is at least as up-to-date as yours.
- Compare indexes and terms from last log entries.
- Step 2: be very careful about when an entry is considered committed
- As soon as replicated on majority? Unsafe!
- Committed => entry in current term replicated on majority
- All preceding entries also committed because of Log Matching Property
Persistent Storage
- Each server stores the following in persistent storage (e.g. disk or flash):
- Current term
- Most recent vote granted
- Log
- These must be recovered from persistent storage after a crash
- If a server loses its persistent storage, it cannot participate in
the cluster anymore
Implementing Raft
- Follow Figure 2 exactly! Every tiny detail matters.
Client Interactions
- Clients interact only with the leader
- Initially, a client can send a request to any server
- If not leader, it rejects request, returns info about
most recent leader
- Client retries until it reaches leader
- If leader crashes while executing a client request, the client
retries (with a new randomly-chosen server) until the request succeeds
- This can result in multiple executions of a command: not consistent!
- Goal: linearizability: System behaves as if each operation
is executed exactly once, atomically, sometime between sending of
the request and receipt of the response.
- Solution:
- Client generates unique serial number for each request
(client id, request number)
- When retrying, client reuses the original serial number
- Servers track latest serial number processed for each client,
plus associated response (must be stored persistently)
- When leader detects duplicate, it sends old response without
reexecuting request
Other Issues
- Cluster membership
- Log compaction
- See paper for details