A linearizable replicated log
- Raft picks one server to be leader
- clients send Append/Get RPCs to leader
- leader sends each client command to all replicas
- each follower appends to log
- goal is to have identical logs
- log entry is “committed” if a majority put it in their logs – won’t be forgotten majority -> can proceed despite minority of failed servers
- servers execute entry once the leader says it’s committed
Will the Raft logs always be exactly the same replicas?
- no: some replicas may lag
- no: we’ll see that they can have different entries!
- if a server has executed a command in a given entry number:
- no other server will execute a different command for that entry
- i.e. the servers will agree on the command for each entry
- State Machine Safety (Figure 3)
Two parts
- electing a new leader
- ensuring identical logs after failures
Leader election
- Raft numbers the sequence of leaders
- new leader -> new term
- a term has at most one leader; might have no leader
- the numbering helps servers follow latest leader, not superseded leader
- when does Raft start a leader election?
- other servers don’t hear from current leader for a while
- they increment local currentTerm, become candidates, start election
- how to ensure at most one leader in a term?
- (Figure 2 RequestVote RPC and Rules for Servers)
- leader must get votes from a majority of servers
- each server can cast only one vote per term votes for first server that asks (within Figure 2 rules)
- at most one server can get majority of votes for a given term -> at most one leader even if network partition -> election can succeed even if some servers have failed
- how does a server know that election succeeded?
- winner gets yes votes from majority
- others see the AppendEntries heart-beats from winner
- an election may not succeed
- none gets majority
- even # of live servers, two candidates each get half
- less than a majority of servers are reachable
- what happens after a failed election?
- another timeout, increment currentTerm, become candidate
- higher term takes precedence, candidates for older terms quit
- how does Raft reduce chances of election failure due to split vote?
- each server delays a random amount of time before starting candidacy
- why is the random delay useful?
- what if the old leader isn’t aware a new one is elected?
- a new leader elected means a majority of servers have incremented currentTerm
- so the old leader (w/ old term) can’t get majority for AppendEntries
- so the old leader won’t commit or execute any new log entries
- thus no split brain despite partition
- but a minority may accept old server’s AppendEntries so logs may diverge at the end of the old term
Log synchronization after failures
- how can logs disagree?
- a log might be short – missing entries at end of the term
leader of term 3 crashes before sending all AppendEntries
S1: 3 S2: 3 3 S3: 3 3
- logs might have different commands in same entry!
after a series of leader crashes, e.g.
10 11 12 13 <- log entry # S1: 3 S2: 3 3 4 S3: 3 3 5
- new leader will force its log on followers; example:
- S3 is chosen as new leader for term 6
- S3 sends a new command, entry 13, term 6
- AppendEntries, previous entry 12, previous term 5
- S2 replies false (AppendEntries step 2)
- S3 decrements nextIndex[S2] to 12
- S3 sends AppendEntries, prev entry 11, prev term 3
- S2 deletes entry 12 (AppendEntries step 3)
- similar story for S1, but have to go back one farther
- a log might be short – missing entries at end of the term
leader of term 3 crashes before sending all AppendEntries
- the result of roll-back:
- each live follower deletes tail of log that differs from leader
- thus live followers’ logs are prefixes of leader’s log
- and live followers that keep up will have logs identical to leader’s except they may be missing the few most recent entries
- faster rollback?
- when rejecting appendEntries, return the beginning/end index of the conflicting term.
New leader must be “up to date”
- could new leader roll back executed entries from end of previous term?
- i.e. could an executed entry be missing from the new leader’s log?
- this would be a disaster – violates State Machine Safety
- solution: Raft won’t elect a leader that might not have an executed entry
- could we choose leader with longest log?
example: S1: 5 6 7 S2: 5 8 S3: 5 8
- first, could this scenario happen? how?
- S1 leader in term 6; crash+reboot; leader in term 7; crash and stay down both times it crashed after only appending to its own log
- S2 leader in term 8, only S2+S3 alive, then crash
- who should be next leader?
- S1 has longest log, but entry 8 could have been executed !!!
- so new leader can only be one of S2 or S3
- i.e. the rule cannot be simply choosing the “longest log”
- first, could this scenario happen? how?
- end of 5.4.1 explains “at least as up to date” voting rule
- compare last entry – higher term wins
- if equal terms, longer log wins
- so only S2 or S3 can be leader, will force S1 to discard 6,7
- ok since no majority -> not executed -> no client reply
- the point:
- “at least as up to date” rule ensures new leader’s log contains
- all potentially executed entries
- so new leader won’t roll back any executed operation
- “at least as up to date” rule ensures new leader’s log contains
A corner case
- why “log[N].term==currentTerm” in figure 2’s Rules for Servers?
- why can’t we execute any entry that’s on a majority?
- how could such an entry be discarded?
figure 8 describes an example S1: 1 2 1 2 4 S2: 1 2 1 2 S3: 1 --> 1 2 S4: 1 1 S5: 1 1 3
- S1 was leader in term 2, sends out two copies of 2
- S5 leader in term 3
- S1 in term 4, sends one more copy of 2 (b/c S3 rejected op 4)
- what if S5 now becomes leader?
- S5 can get a majority (w/o S1)
- S5 will roll back 2 and replace it with 3
- so “present on a majority” != “committed”
- how could such an entry be discarded?
- why can’t we execute any entry that’s on a majority?
- so an entry becomes committed if:
- it reached a majority in the term it was initially sent out, or
- if a subsequent log entry becomes committed.
- could have committed if S1 hadn’t lost term=4 leadership
- this is a consequence of:
- the “more up to date” voting rule favoring higher term, and
- the leader imposing its log on followers
Log compaction and snapshots
- problem:
- log will get to be huge – much larger than state-machine state!
- will use lots of memory
- will take a long time to
- read and re-evaluate after reboot
- send to a newly added replica
- what constrains how a server can discard old parts of log?
- can’t forget un-committed operations
- need to replay if crash and restart
- may be needed to bring other servers up to date
- solution: service periodically creates persistent “snapshot”
- copy of entire state-machine state through a specific log entry
- e.g. k/v table, client duplicate state
- service writes snapshot to persistent storage (disk)
- service tells Raft it is snapshotted through some entry
- Raft discards log before that entry
- a server can create a snapshot and discard log at any time
- copy of entire state-machine state through a specific log entry
Configuration change
- configuration change (Section 6)
- configuration = set of servers
- every once in a while you might want to
- move to an new set of servers, or
- increase/decrease the number of servers
- human initiates configuration change, Raft manages it
- we’d like Raft to execute correctly across configuration changes
- why doesn’t a straightforward approach work?
- suppose each server has the list of servers in the current config
- change configuration by telling each server the new list
- using some mechanism outside of Raft
- problem: they will learn new configuration at different times
- example: want to replace S3 with S4
S1: 1,2,3 1,2,4 S2: 1,2,3 1,2,3 S3: 1,2,3 1,2,3 S4: 1,2,4
- OOPS! now two leaders could be elected!
- S2 and S3 could elect S2
- S1 and S4 could elect S1
- Raft configuration change
- idea: “joint consensus” stage that includes both old and new configuration
- leader of old group logs entry that switches to joint consensus
- Cold,new – contains both configurations
- during joint consensus, leader gets AppendEntries majority in both old and new
- after Cold,new commits, leader sends out Cnew
S1: 1,2,3 1,2,3+1,2,4 S2: 1,2,3 S3: 1,2,3 S4: 1,2,3+1,2,4
- no leader will use Cnew until Cold,new commits in both old and new.
- so there’s no time at which one leader could be using Cold
- and another could be using Cnew
- if crash but new leader didn’t see Cold,new
- then old group will continue, no switch, but that’s OK
- if crash and new leader did see Cold,new,
- it will complete the configuration change
An important aside:
- can leader execute read-only operations locally?
- without sending to followers in AppendEntries and waiting for commit?
- very tempting, since r/o ops may dominate, and don’t change state
- why might that be a bad idea?
- how could we make the idea work?