Understanding Raft Consensus: Building Blocks of Distributed Databases

System Architecture Group•Jan 2, 2025

16 min read

Understanding Raft Consensus: Building Blocks of Distributed Databases

Introduction

Consensus algorithms are the foundation of every replicated database, distributed lock service, and configuration store. Raft was designed to be understandable -- and that makes it an excellent starting point for learning about distributed consensus.

What Problem Does Raft Solve?

In a distributed system, you want multiple servers to agree on a sequence of values (the replicated log). Even if some servers crash or network partitions occur, the remaining servers should continue to operate correctly and consistently.

Raft achieves this by electing a leader that coordinates all changes. As long as a majority of servers are available, the system continues to make progress.

graph TD
    subgraph "Raft State Machine"
        F[Follower]
        C[Candidate]
        L[Leader]
        
        F -->|Timeout| C
        C -->|Majority Votes| L
        L -->|Discover Higher Term| F
        C -->|Discover Higher Term| F
        C -->|Timeout| C
    end
    
    style L fill:#fff9c4,stroke:#fbc02d
    style C fill:#ffe0b2,stroke:#f57c00
    style F fill:#e1f5fe,stroke:#0288d1

Leader Election

Raft divides time into terms. Each term begins with an election. A node starts an election by incrementing its term, voting for itself, and requesting votes from peers.

type RaftNode struct {
    mu          sync.Mutex
    id          int
    currentTerm int
    votedFor    int
    state       NodeState // Follower, Candidate, Leader
    log         []LogEntry
    commitIndex int
    peers       []int
}
 
func (n *RaftNode) startElection() {
    n.mu.Lock()
    n.currentTerm++
    n.votedFor = n.id
    n.state = Candidate
    term := n.currentTerm
    lastLogIndex := len(n.log) - 1
    lastLogTerm := 0
    if lastLogIndex >= 0 {
        lastLogTerm = n.log[lastLogIndex].Term
    }
    n.mu.Unlock()
 
    votes := 1 // Self-vote
    var voteMu sync.Mutex
 
    for _, peer := range n.peers {
        go func(p int) {
            reply := n.sendRequestVote(p, &RequestVoteArgs{
                Term:         term,
                CandidateId:  n.id,
                LastLogIndex: lastLogIndex,
                LastLogTerm:  lastLogTerm,
            })
            if reply.VoteGranted {
                voteMu.Lock()
                votes++
                if votes > (len(n.peers)+1)/2 {
                    n.becomeLeader()
                }
                voteMu.Unlock()
            }
        }(peer)
    }
}

A candidate wins the election if it receives votes from a majority of servers. If no candidate wins (split vote), a new election begins after a randomized timeout.

sequenceDiagram
    participant C as Candidate
    participant F as Follower
    
    Note over C: Term++ (Vote Self)
    C->>F: RequestVote(Term, lastLogIndex)
    
    alt Term < currentTerm
        F-->>C: VoteDenied (Term outdated)
    else Log ok & Not Voted
        Note over F: VotedFor = C
        F-->>C: VoteGranted
    end

Log Replication

Once a leader is elected, it accepts client requests, appends entries to its log, and replicates them to followers using AppendEntries RPCs.

func (n *RaftNode) appendEntries(peer int, args *AppendEntriesArgs) {
    reply := n.sendAppendEntries(peer, args)
    if reply.Success {
        n.nextIndex[peer] = args.PrevLogIndex + len(args.Entries) + 1
        n.matchIndex[peer] = n.nextIndex[peer] - 1
        n.advanceCommitIndex()
    } else {
        // Decrement nextIndex and retry
        n.nextIndex[peer]--
    }
}

An entry is committed once a majority of servers have replicated it. Committed entries are safe -- they will never be lost even if servers crash.

sequenceDiagram
    participant C as Client
    participant L as Leader
    participant F as Follower
    
    C->>L: Command(X)
    L->>L: Log Append(X)
    par Replicate
        L->>F: AppendEntries(Entries=[X])
        F->>F: Log Append(X)
        F-->>L: Success
    end
    L->>L: Commit(X)
    L-->>C: Response

Safety Guarantees

Raft provides several important safety guarantees: Election Safety ensures at most one leader per term. Leader Append-Only means a leader never overwrites or deletes entries in its log. Log Matching guarantees that if two logs contain an entry with the same index and term, all preceding entries are identical.

Real-World Implementations

Raft powers many production systems. etcd uses Raft for Kubernetes' control plane. CockroachDB uses multi-Raft, running thousands of Raft groups for fine-grained replication. TiKV uses Raft for its distributed key-value storage layer.

Common Pitfalls

The most common mistakes when implementing Raft are: not persisting state before responding to RPCs, not handling split votes correctly, and not implementing log compaction (snapshotting). Without log compaction, the log grows unboundedly and new servers take forever to catch up.

Conclusion

Raft makes distributed consensus approachable. Understanding its mechanics helps you reason about the behaviour of systems like etcd, CockroachDB, and TiKV. While you may never need to implement Raft yourself, knowing how it works makes you a better distributed systems engineer.

BackendBytes Architect Team

System Architecture Group

Experts in distributed systems, scalability, and high-performance computing.

Leader Election

Raft divides time into terms. Each term begins with an election. A node starts an election by incrementing its term, voting for itself, and requesting votes from peers.

type RaftNode struct {
    mu          sync.Mutex
    id          int
    currentTerm int
    votedFor    int
    state       NodeState // Follower, Candidate, Leader
    log         []LogEntry
    commitIndex int
    peers       []int
}
 
func (n *RaftNode) startElection() {
    n.mu.Lock()
    n.currentTerm++
    n.votedFor = n.id
    n.state = Candidate
    term := n.currentTerm
    lastLogIndex := len(n.log) - 1
    lastLogTerm := 0
    if lastLogIndex >= 0 {
        lastLogTerm = n.log[lastLogIndex].Term
    }
    n.mu.Unlock()
 
    votes := 1 // Self-vote
    var voteMu sync.Mutex
 
    for _, peer := range n.peers {
        go func(p int) {
            reply := n.sendRequestVote(p, &RequestVoteArgs{
                Term:         term,
                CandidateId:  n.id,
                LastLogIndex: lastLogIndex,
                LastLogTerm:  lastLogTerm,
            })
            if reply.VoteGranted {
                voteMu.Lock()
                votes++
                if votes > (len(n.peers)+1)/2 {
                    n.becomeLeader()
                }
                voteMu.Unlock()
            }
        }(peer)
    }
}

A candidate wins the election if it receives votes from a majority of servers. If no candidate wins (split vote), a new election begins after a randomized timeout.

sequenceDiagram participant C as Candidate participant F as Follower Note over C: Term++ (Vote Self) C->>F: RequestVote(Term, lastLogIndex) alt Term < currentTerm F-->>C: VoteDenied (Term outdated) else Log ok & Not Voted Note over F: VotedFor = C F-->>C: VoteGranted end

Log Replication

Once a leader is elected, it accepts client requests, appends entries to its log, and replicates them to followers using AppendEntries RPCs.

func (n *RaftNode) appendEntries(peer int, args *AppendEntriesArgs) {
    reply := n.sendAppendEntries(peer, args)
    if reply.Success {
        n.nextIndex[peer] = args.PrevLogIndex + len(args.Entries) + 1
        n.matchIndex[peer] = n.nextIndex[peer] - 1
        n.advanceCommitIndex()
    } else {
        // Decrement nextIndex and retry
        n.nextIndex[peer]--
    }
}

An entry is committed once a majority of servers have replicated it. Committed entries are safe -- they will never be lost even if servers crash.

sequenceDiagram participant C as Client participant L as Leader participant F as Follower C->>L: Command(X) L->>L: Log Append(X) par Replicate L->>F: AppendEntries(Entries=[X]) F->>F: Log Append(X) F-->>L: Success end L->>L: Commit(X) L-->>C: Response

Safety Guarantees

Understanding Raft Consensus: Building Blocks of Distributed Databases

Introduction

What Problem Does Raft Solve?

Leader Election

Log Replication

Safety Guarantees

Real-World Implementations

Common Pitfalls

Conclusion

Read Next

Building Resilient Distributed Systems with Go

Caching Strategies at Scale: Beyond Simple Key-Value Stores

PostgreSQL vs MongoDB: The definitive guide for 2024

Building Resilient Distributed Systems with Go

Caching Strategies at Scale: Beyond Simple Key-Value Stores

PostgreSQL vs MongoDB: The definitive guide for 2024

Understanding Raft Consensus: Building Blocks of Distributed Databases

Introduction

What Problem Does Raft Solve?

Leader Election

Log Replication

Safety Guarantees

Real-World Implementations

Common Pitfalls

Conclusion

Read Next

Building Resilient Distributed Systems with Go

Caching Strategies at Scale: Beyond Simple Key-Value Stores

PostgreSQL vs MongoDB: The definitive guide for 2024

Building Resilient Distributed Systems with Go

Caching Strategies at Scale: Beyond Simple Key-Value Stores

PostgreSQL vs MongoDB: The definitive guide for 2024