Understanding Raft Consensus: The Algorithm That Keeps Your Database Honest
Key Takeaways
- →Raft tolerates (N-1)/2 failures: 3-node cluster survives 1 failure, 5-node survives 2 — majority voting prevents split brain (two leaders in same term)
- →Up-to-date vote check during elections ensures new leaders inherit all previously committed entries — candidates with stale logs are rejected, preventing log divergence
- →Log matching with term numbers guarantees consistency: followers only append entries if the previous entry matches (same index and term), so mismatches are corrected recursively
- →Persisting state before responding to RPCs is critical — if you respond but crash before persisting, you can vote twice for different candidates, breaking the safety guarantee
The classic etcd-partition-divergence scenario. A production Kubernetes cluster experiences a network partition that isolates two etcd nodes from the other three. The isolated pair correctly downgrades to followers; the three-node majority elects a new leader[Ongaro & Ousterhout, 2014]. All by design.
But what if a misconfigured watch keeps one isolated node from processing the term update promptly? It continues serving reads to a local operator with stale state — secrets that have been rotated are served from a previous version. Recovering from even brief periods of divergent state can take hours of manual reconciliation.
This is why you learn Raft deeply — not to implement it, but to understand every guarantee it makes and every assumption it requires. When something breaks in etcd, CockroachDB, or your own distributed system, knowing which invariant failed tells you exactly where to look.
Raft[Ongaro & Ousterhout, 2014] is a consensus algorithm that tolerates (N-1)/2 node failures in an N-node cluster by electing a leader, replicating a log, and committing entries only when replicated to a majority. Two invariants prevent split brain: only one leader per term, and leaders always have previously committed entries.
- Majority voting prevents multiple leaders in the same term
- Log matching with term checks guarantees consistency across reboots and partitions
- Snapshots and backtracking optimizations make replication fast even after long divergence
Raft in 60 Seconds
Raft abstracts distributed consensus into three roles and one invariant: elect one leader per term, replicate its log, and never lose committed entries.
| Component | Role | Key Rule |
|---|---|---|
| Roles | Every node is Follower, Candidate, or Leader | Only leaders send log entries; candidates become leader with majority votes; followers accept the leader's entries |
| Terms | Logical clock (monotonic integers) | Higher term seen on any RPC → step down to follower; each node votes once per term |
| Log replication | Leader appends entries, followers catch up | Logs match up to the commit index; previous-term entries implicitly committed; only current-term entries advance commit index |
| Election safety | Two majorities must overlap | Any two majorities in an N-node cluster share ≥1 node; committed entry replicated to majority M1, winner elected by majority M2 → M1 ∩ M2 has the entry |
| Quorum size | Majority of nodes alive | N nodes tolerate ⌊(N-1)/2⌋ failures; 3-node cluster tolerates 1, 5-node tolerates 2 |
The hard part isn't roles or quorum — it's persisting state before responding to RPCs (if you crash after responding but before persisting, you break safety). The second-hardest part is the up-to-date vote check during elections: followers refuse to vote for candidates whose logs are stale, which guarantees new leaders inherit all committed entries[Ongaro & Ousterhout, 2014].
Leader Election and Terms
[Ongaro & Ousterhout, 2014]sequenceDiagram
participant F1 as Follower A
participant C as Follower B<br/>(becomes Candidate)
participant F2 as Follower C
Note over C: Heartbeat timeout expires
C->>C: Increment term, vote for self
par RequestVote RPCs
C->>F1: RequestVote(term=2, lastLogIndex, lastLogTerm)
C->>F2: RequestVote(term=2, lastLogIndex, lastLogTerm)
end
F1->>C: VoteGranted=true
F2->>C: VoteGranted=true
Note over C: Majority received → becomes Leader
loop Heartbeat interval
C->>F1: AppendEntries (heartbeat)
C->>F2: AppendEntries (heartbeat)
end
Every Raft node maintains a currentTerm (persisted) and a votedFor state (persisted). When a follower's heartbeat times out, it increments its term, votes for itself, and sends RequestVote to all peers. Peers grant a vote only if:
- The candidate's term is ≥ the peer's current term
- The peer hasn't already voted this term
- The candidate's log is at least as up-to-date as the peer's log
"Up-to-date" means: higher last-term wins; if tied on last-term, longer log wins. This is the safety mechanism — a node with the latest committed entry must win the election (because any two majorities overlap).
func (n *RaftNode) handleRequestVote(args *RequestVoteArgs) RequestVoteReply {
n.mu.Lock()
defer n.mu.Unlock()
// Reject stale terms
if args.Term < n.currentTerm {
return RequestVoteReply{Term: n.currentTerm, VoteGranted: false}
}
if args.Term > n.currentTerm {
n.currentTerm = args.Term
n.state = Follower
n.votedFor = -1
}
// Already voted for someone else this term
if n.votedFor != -1 && n.votedFor != args.CandidateId {
return RequestVoteReply{Term: n.currentTerm, VoteGranted: false}
}
// Critical check: candidate's log must be at least as up-to-date as ours
myLastTerm := 0
myLastIndex := -1
if len(n.log) > 0 {
myLastTerm = n.log[len(n.log)-1].Term
myLastIndex = len(n.log) - 1
}
candidateUpToDate := args.LastLogTerm > myLastTerm ||
(args.LastLogTerm == myLastTerm && args.LastLogIndex >= myLastIndex)
if !candidateUpToDate {
return RequestVoteReply{Term: n.currentTerm, VoteGranted: false}
}
n.votedFor = args.CandidateId
n.persist() // MUST persist before responding
n.resetElectionTimer()
return RequestVoteReply{Term: n.currentTerm, VoteGranted: true}
}Key detail: n.persist() is called before responding. If the node crashes after responding but before persisting, it recovers without knowing it voted, and can vote again for a different candidate. This breaks the guarantee that two candidates can't both receive a majority.
Log Replication and Safety
Once elected, the leader continuously replicates its log via AppendEntries RPCs[Ongaro & Ousterhout, 2014]. A follower appends new entries only if its log contains the previous entry (same index and term). This check, applied recursively, guarantees all logs match on committed entries. The leader retries with lower indices until the consistency check passes; once it does, all subsequent entries are guaranteed to match.
The replication round-trip is the entire commit story in one picture:
sequenceDiagram
participant C as Client
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
C->>L: write(x=5)
L->>L: append to log<br/>(uncommitted)
par Replicate to followers
L->>F1: AppendEntries(prevIdx, prevTerm, [x=5])
F1->>F1: check prev matches?<br/>append x=5
F1-->>L: Success
and
L->>F2: AppendEntries(prevIdx, prevTerm, [x=5])
F2->>F2: check prev matches?<br/>append x=5
F2-->>L: Success
end
Note over L: Quorum reached (2 of 3)<br/>commitIndex advances
L->>L: apply to state machine
L-->>C: Success
Note over L,F2: Next AppendEntries carries new commitIndex<br/>followers apply x=5 to their state machines
The diagram is the entire safety argument: a write commits only when a majority has it durably; followers learn commitIndex from the next AppendEntries. The leader never tells the client "OK" before the quorum is durable.
func (n *RaftNode) handleAppendEntries(args *AppendEntriesArgs) AppendEntriesReply {
n.mu.Lock()
defer n.mu.Unlock()
if args.Term < n.currentTerm {
return AppendEntriesReply{Term: n.currentTerm, Success: false}
}
n.resetElectionTimer() // Valid leader heartbeat
if args.Term > n.currentTerm {
n.currentTerm = args.Term
n.state = Follower
n.votedFor = -1
}
// Log consistency check: must have entry at PrevLogIndex with PrevLogTerm
if args.PrevLogIndex >= 0 {
if args.PrevLogIndex >= len(n.log) {
return AppendEntriesReply{
Term: n.currentTerm,
Success: false,
ConflictIndex: len(n.log),
ConflictTerm: -1,
}
}
if n.log[args.PrevLogIndex].Term != args.PrevLogTerm {
// Mismatch: backtrack to first entry of the conflicting term
conflictTerm := n.log[args.PrevLogIndex].Term
conflictIndex := args.PrevLogIndex
for conflictIndex > 0 && n.log[conflictIndex-1].Term == conflictTerm {
conflictIndex--
}
return AppendEntriesReply{
Term: n.currentTerm,
Success: false,
ConflictTerm: conflictTerm,
ConflictIndex: conflictIndex,
}
}
}
// Append entries, truncating any conflicting suffix
for i, entry := range args.Entries {
idx := args.PrevLogIndex + 1 + i
if idx < len(n.log) && n.log[idx].Term != entry.Term {
n.log = n.log[:idx]
}
if idx >= len(n.log) {
n.log = append(n.log, entry)
}
}
n.persist()
// Update commit index if leader's is higher
if args.LeaderCommit > n.commitIndex {
n.commitIndex = min(args.LeaderCommit, len(n.log)-1)
}
return AppendEntriesReply{Term: n.currentTerm, Success: true}
}ConflictTerm Backtracking Optimization
Naively, when a follower rejects AppendEntries, the leader decrements nextIndex[follower] by 1 and retries. For a follower that's 10,000 entries behind, this requires 10,000 round trips. The ConflictTerm and ConflictIndex fields let the leader skip entire terms at once:
If the follower rejects with ConflictTerm=5 (term 5 contains the conflict), the leader finds the last entry in its own log with term 5 and jumps directly there. A follower 10,000 entries behind now catches up in O(number of diverged terms) round trips instead of O(number of diverged entries).
The Commit Index Rule: Current-Term Constraint
A leader advances its commitIndex only when an entry from the current term is replicated to a majority. This seems overly conservative — why not commit an older entry once it's replicated to a majority?
The answer is Figure 8 of the Raft paper: a previous-term entry can be replicated to a majority and then overwritten by a new leader who doesn't have it. Committing only current-term entries (and relying on Log Matching to implicitly commit all preceding entries) prevents this rare but catastrophic scenario. As a practical benefit, it also simplifies implementation: a leader doesn't need to track which previous-term entries are committed.
func (n *RaftNode) maybeAdvanceCommitIndex() {
// Find the highest index replicated to a majority
for idx := len(n.log) - 1; idx > n.commitIndex; idx-- {
// CRITICAL: Only advance for entries from the current term
if n.log[idx].Term != n.currentTerm {
continue
}
replicationCount := 1 // Leader has it
for _, peer := range n.peers {
if n.matchIndex[peer] >= idx {
replicationCount++
}
}
majority := (len(n.peers)+1)/2 + 1
if replicationCount >= majority {
n.commitIndex = idx
return
}
}
}Snapshots: Each server periodically serializes state, saves the index and term of the last included log entry, and discards earlier entries. When a follower falls far behind, the leader sends the snapshot via InstallSnapshot RPC instead of replaying potentially terabytes of log.
Membership changes: A cluster transitions through an intermediate joint configuration C_old,new that requires majorities from both old and new sets, preventing split brain. Production rule: never change more than one server at a time. etcd enforces this — etcdctl member add adds one member; wait for it to catch up before adding another.
The Heartbeat and Election Timer
The leader maintains authority by sending empty AppendEntries RPCs as heartbeats. Followers reset their election timer on every valid heartbeat[Ongaro & Ousterhout, 2014]:
// Heartbeat sender — only the leader runs this
func (n *RaftNode) heartbeatLoop() {
ticker := time.NewTicker(50 * time.Millisecond)
defer ticker.Stop()
for {
select {
case <-ticker.C:
n.mu.Lock()
if n.state != Leader {
n.mu.Unlock()
return
}
for _, peer := range n.peers {
go n.sendHeartbeat(peer)
}
n.mu.Unlock()
case <-n.shutdown:
return
}
}
}
// Election-timer goroutine on every follower
func (n *RaftNode) electionTicker() {
for {
timeout := time.Duration(150+rand.Intn(150)) * time.Millisecond
select {
case <-time.After(timeout):
n.mu.Lock()
if n.state != Leader && time.Since(n.lastHeartbeat) >= timeout {
n.startElection() // becomes Candidate, increments term
}
n.mu.Unlock()
case <-n.shutdown:
return
}
}
}Two production rules visible here: (1) the heartbeat interval (50 ms) must be much smaller than the election timeout (150-300 ms randomised) so a healthy network never triggers a spurious election; (2) the election timeout MUST be randomised across followers, otherwise multiple candidates trigger simultaneously and split the vote.
Picking a Raft Library (Production Checklist)
[Akkoyunlu et al., 1975]- Persistence layer: must write
currentTermandvotedForbefore responding to RPCs, not after. Audit this in any implementation. - Log matching: verify the implementation checks both log term and index, not just length.
- Up-to-date vote check: confirm it compares last-term first, then last-index. Missing this breaks safety.
- ConflictTerm backtracking: recommended for performance (etcd-io/raft includes this; HashiCorp Raft has it as an option).
- Snapshot support: needed for long-running clusters. Check the library handles
InstallSnapshotcorrectly (no regression on log position). - ReadIndex protocol: if you need linearizable reads from followers, confirm the library implements it (HashiCorp Raft has this; etcd-io/raft requires custom work).
- Membership changes: must support joint consensus. Both etcd-io/raft and HashiCorp Raft do; verify your use case doesn't change multiple members at once.
- Testing: look for built-in chaos/failure test modes (etcd-io/raft has a simulator; Dragonboat has comprehensive fuzz testing).
Operating Raft: the runbook code
Picking the library is the easy part. The fragile part is the day a follower's Raft log diverges at 03:00 and the on-call engineer has six minutes of error budget left. Below are the four pieces of plumbing every Raft operator ends up writing — start from these instead of inventing them under pressure.
Cluster status at a glance. The first thing you run on a paging incident is etcdctl endpoint status. The IS LEADER column tells you whether you have quorum, RAFT TERM reveals churn, and RAFT INDEX minus the highest-replicated index across followers is your true replication lag (in entries, not seconds):
etcdctl --endpoints=https://10.0.0.1:2379,https://10.0.0.2:2379,https://10.0.0.3:2379 \
--cacert=/etc/etcd/ca.crt --cert=/etc/etcd/client.crt --key=/etc/etcd/client.key \
endpoint status --write-out=table
# +-----------------+------------------+---------+---------+-----------+------------+-----------+
# | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX|
# +-----------------+------------------+---------+---------+-----------+------------+-----------+
# | 10.0.0.1:2379 | 8e9e05c52164694d | 3.5.12 | 87 MB | true | 42 | 1284901 |
# | 10.0.0.2:2379 | 91bc44ec4d0d6b3a | 3.5.12 | 87 MB | false | 42 | 1284900 |
# | 10.0.0.3:2379 | b3f8c1a7e2d5408f | 3.5.12 | 86 MB | false | 42 | 1284734 |
# +-----------------+------------------+---------+---------+-----------+------------+-----------+Two readings to internalise: matching RAFT TERM across all rows means no recent election storm, and a follower whose RAFT INDEX trails the leader by more than ~10k entries is on the path to needing a snapshot install rather than a tail-replication catch-up.
Alert on leader churn, not just leadership loss. A cluster that re-elects three times in five minutes is unhealthy even if up{job="etcd"} reports 1 for every node. Wire this Prometheus rule into your etcd-rules.yaml group:
groups:
- name: etcd-raft
interval: 30s
rules:
- alert: EtcdHighLeaderChangeRate
expr: |
increase(etcd_server_leader_changes_seen_total[10m]) > 3
for: 2m
labels:
severity: page
team: platform
annotations:
summary: "etcd cluster has elected a new leader more than 3 times in 10m"
description: |
Cluster {{ $labels.cluster }} on {{ $labels.instance }} saw
{{ $value }} leader changes in the last 10 minutes. Investigate
disk fsync latency (etcd_disk_wal_fsync_duration_seconds) and
peer round-trip (etcd_network_peer_round_trip_time_seconds)
before any further writes are issued.
runbook_url: https://runbooks.example.com/etcd-leader-churnThe runbook URL matters — the operator who gets paged at 03:00 is not the operator who tuned this alert, and "investigate disk fsync first" is the difference between a five-minute fix and a five-hour outage.
Apply with the right timeout chain in hashicorp/raft. The most common bug in production Raft applications is calling raft.Apply() with the request's HTTP context as the timeout. That is wrong: Apply returns once the entry is committed and the FSM has consumed it, which can be slow under follower lag. Use a separate, generous Raft timeout and propagate the request context only for the cancellation signal:
package store
import (
"context"
"encoding/json"
"errors"
"fmt"
"time"
"github.com/hashicorp/raft"
)
// FSM is registered with the Raft node and consumes committed log entries.
type FSM struct {
state map[string]string
}
func (f *FSM) Apply(log *raft.Log) interface{} {
var cmd Command
if err := json.Unmarshal(log.Data, &cmd); err != nil {
return fmt.Errorf("decode command: %w", err)
}
switch cmd.Op {
case "set":
f.state[cmd.Key] = cmd.Value
return nil
case "delete":
delete(f.state, cmd.Key)
return nil
default:
return fmt.Errorf("unknown op %q", cmd.Op)
}
}
// Command is the wire format every replicated entry encodes to.
type Command struct {
Op string `json:"op"`
Key string `json:"key"`
Value string `json:"value,omitempty"`
}
// Set replicates a write through the leader. The request context controls
// caller-side cancellation; the Raft timeout (raftTimeout) controls how long
// we wait for quorum + FSM apply before returning ErrLeadershipLost.
func (s *Store) Set(ctx context.Context, key, value string) error {
if s.raft.State() != raft.Leader {
return raft.ErrNotLeader
}
payload, err := json.Marshal(Command{Op: "set", Key: key, Value: value})
if err != nil {
return fmt.Errorf("encode command: %w", err)
}
const raftTimeout = 5 * time.Second
future := s.raft.Apply(payload, raftTimeout)
// Honour caller cancellation without orphaning the in-flight Apply.
done := make(chan error, 1)
go func() { done <- future.Error() }()
select {
case err := <-done:
if errors.Is(err, raft.ErrLeadershipLost) {
return fmt.Errorf("apply: leader stepped down mid-flight: %w", err)
}
return err
case <-ctx.Done():
return ctx.Err()
}
}The two non-obvious lines are raftTimeout = 5 * time.Second (longer than your p99 commit latency, shorter than client patience) and the select on ctx.Done() — without it, an HTTP cancellation leaks a goroutine that still owns a Raft future.
Snapshot, then restore — practise it before you need it. Every Raft operator gets the recovery flow wrong on first try. The order matters: snapshot from a healthy member, scp to the new node, stop the broken etcd, restore into a fresh data dir, then start with --initial-cluster-state=existing. Skipping any step corrupts the WAL:
# 1. Take a consistent snapshot from any current member.
ETCDCTL_API=3 etcdctl --endpoints=https://10.0.0.1:2379 \
--cacert=/etc/etcd/ca.crt --cert=/etc/etcd/client.crt --key=/etc/etcd/client.key \
snapshot save /var/backups/etcd-$(date +%Y%m%d-%H%M%S).db
# 2. Verify the snapshot is intact (hash + revision sanity check).
ETCDCTL_API=3 etcdctl snapshot status /var/backups/etcd-20260501-0312.db --write-out=table
# 3. On the recovering node: stop etcd and clear the corrupt data dir.
systemctl stop etcd
mv /var/lib/etcd /var/lib/etcd.broken-$(date +%s)
# 4. Restore — the --name and --initial-cluster MUST match this node's identity.
ETCDCTL_API=3 etcdctl snapshot restore /var/backups/etcd-20260501-0312.db \
--name node-3 \
--initial-cluster node-1=https://10.0.0.1:2380,node-2=https://10.0.0.2:2380,node-3=https://10.0.0.3:2380 \
--initial-cluster-token etcd-prod-cluster-1 \
--initial-advertise-peer-urls https://10.0.0.3:2380 \
--data-dir /var/lib/etcd
# 5. Start etcd; it rejoins as a follower and catches up via InstallSnapshot.
systemctl start etcd
journalctl -u etcd -f | grep -E 'raft|snapshot|catch up'The trap on step 4 is --initial-cluster-token: it must match the original cluster token, otherwise the restored node believes it is a brand-new cluster and silently splits brain on rejoin. Run this drill quarterly against a staging cluster — the first time should never be in production.
Driving the same flow inside Kubernetes. Most production etcd lives inside kube-system as static pods. The same status and consistency checks are still one command away — and a post-restore consistency probe is what tells you the cluster is actually healthy, not just running:
# Inspect Raft state from inside a control-plane node, no external endpoint needed.
kubectl -n kube-system exec etcd-cp-1 -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint status --cluster --write-out=table
# After any restore, run a linearizable read to confirm the cluster serves
# strongly-consistent traffic and a member-list to confirm membership shape.
kubectl -n kube-system exec etcd-cp-1 -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
get / --prefix --keys-only --consistency=l --limit=1 \
&& etcdctl member list --write-out=tableIf --consistency=l (linearizable) fails but --consistency=s (serializable) succeeds, the local node is up but isolated from quorum — a much more dangerous state than total failure, because clients reading from that node see stale data while believing they have authoritative answers.
Frequently Asked Questions
How does Raft prevent split brain in a distributed system?
Raft prevents split brain with two invariants: only one leader per term (requiring majority vote), and a leader always has all previously committed entries (enforced by the up-to-date vote check during elections).
What is the difference between Raft and Paxos?
Raft was designed for understandability while providing the same safety guarantees as Paxos. Raft uses a strong leader model with sequential log entries, making it easier to reason about, while Paxos allows flexible ordering but is notoriously difficult to implement correctly.
How many nodes can fail in a Raft cluster?
A Raft cluster of N nodes tolerates (N-1)/2 failures. A 3-node cluster tolerates 1 failure, a 5-node cluster tolerates 2. The cluster needs a majority of nodes alive to elect a leader and commit entries.
What systems use Raft consensus in production?
Raft is used in etcd (Kubernetes' backing store), CockroachDB, TiKV, Consul, and HashiCorp Vault. These systems rely on Raft to replicate state consistently across nodes and survive failures without data loss.
Keep Reading
- Building Resilient Distributed Systems with Go — patterns and pitfalls for production consensus systems
- Postgres Query Planner Internals — storage and planning decisions that interact with replicated state machines
- Event-Driven Microservices with Go and Kafka — replicated logs as the simpler alternative to full Raft when ordering is the only requirement
- Consistent Hashing Guide — the partitioning side of the same problem: Raft replicates one shard, consistent hashing decides which shard owns which key
- Idempotency Patterns in Distributed Systems — Raft commits a state-machine-update once per quorum; idempotent clients make retries safe even when commit-acks are lost
Engineering Team
A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.
Read Next
Consistent Hashing: The Algorithm Behind Every Scalable Distributed System
Adding one cache server shouldn't invalidate every key. Consistent hashing with virtual nodes and bounded loads — full Go and Java implementations.
Distributed Rate Limiting at Scale: The Probabilistic Drop Architecture
Probabilistic drop rate limiting: uncoordinated enforcement bypassing Redis for 1M+ RPS with zero coordination overhead.
Kafka vs RabbitMQ vs NATS vs SQS: Choosing the Right Message Broker
Kafka vs RabbitMQ vs NATS vs SQS: delivery semantics, ordering, throughput, ops complexity, and a decision framework with Go code.