Skip to content

Consistent Hashing: The Algorithm Behind Every Scalable Distributed System

BackendBytes Engineering Team
BackendBytes Engineering Team
9 min read
Consistent Hashing: The Algorithm Behind Every Scalable Distributed System

Key Takeaways

  • Adding one server to a 12-server cluster with hash % N remaps roughly N/(N+1) keys (~92% in this 12-node example); consistent hashing on a ring remaps only K/(N+1) keys — the theoretical minimum proved in Karger et al. 1997
  • Virtual nodes reduce load variance significantly; in our simulations, 150 vnodes per physical node landed around ±4% standard deviation, with diminishing returns past 256
  • Cassandra defaults to 16 vnodes per node since 4.0 (256 in legacy installs); Akamai's CDN uses consistent hashing as published in Karger et al. 1997. Kafka uses modular hashing (murmur2(key) % partitions) by default — not consistent hashing
  • Ring-based implementation: binary search for node position given a key hash takes O(log N) time; large clusters (100+) benefit from the logarithmic lookup

The classic modular-hashing production incident. A 12-node Redis cache uses hash(key) % N to pick a server. Add one node, change N from 12 to 13 — and the formula reassigns 92 percent of keys, every reassignment becomes a cache miss, and the database under the cache absorbs the full miss storm at the moment you were trying to scale up. We debugged this exact failure pattern on multiple production cache tiers — the fix is consistent hashing, not "add capacity faster."

This is a hashing problem, not a capacity problem. The algorithm assumes a fixed N. Adding or removing a server invalidates that assumption mathematically — and the consistent hashing algorithm published by Karger et al. in 1997[Karger et al., 1997] exists specifically to solve it.

Bottom Line

Consistent hashing distributes keys across a hash ring (both servers and keys hashed to positions 0 to 2^32-1). When servers are added/removed, only K/(N+1) keys remap — the theoretical minimum proved in Karger et al. 1997[Karger et al., 1997] — instead of ~N/(N+1) under modular hashing. Production systems use virtual nodes per physical node to ensure even load distribution.

  • Add/remove one server: remap only ~1/(N+1) of keys (~7-8% on a 12-node cluster, vs. ~92% with hash % N)
  • Virtual nodes bring load variance down significantly; in our simulations 150 vnodes lands around ±4% standard deviation
  • Used by Cassandra (default 16 vnodes since 4.0[Apache Cassandra num_tokens], 256 in legacy installs), Akamai's CDN[Karger et al., 1997], memcached clusters. Kafka uses modular hashing — murmur2(key) % partitions — not consistent hashing.

When to Use Consistent Hashing vs. Alternatives

Use caseConsistent hashingModular hashRendezvous hashNotes
Add/remove nodes frequentlyConsistent hashing is standard; rendezvous works for small clusters (<20 nodes)
Large clusters (100+ nodes)Consistent hashing O(log N) lookups; rendezvous O(N) per key becomes expensive
Fixed cluster sizehash % N works fine if topology is stable; simpler code
Natural load balance neededRendezvous hashing balances without virtual nodes
Minimal state trackingBoth need sorted array; rendezvous needs all node positions

Decision rule: Use consistent hashing if cluster membership changes and you have 10+ nodes. For small fixed clusters, hash % N suffices. For tiny clusters (<20 nodes) with frequent scaling, rendezvous hashing may be simpler.

The Problem: Why hash % N Fails at Scale

The simplest distribution: server = hash(key) % N. With a fixed N, keys distribute uniformly and lookups are O(1).

When N changes, nearly everything remaps:

Adding one server (12 → 13): 12/13 = 92% of keys remap
Adding one server (100 → 101): 100/101 = 99% of keys remap

This remapping triggers cascading failures: every remapped key becomes a cache miss, overwhelming your backing store, causing timeouts and user-facing errors.

How Consistent Hashing Works

Both servers and keys are hashed onto a fixed-size ring (0 to 2^32-1) — the structure Amazon's Dynamo paper described as the canonical partitioning mechanism for a high-availability key-value store.[DeCandia et al., 2007] To locate a key's server:

  1. Hash the key to a position on the ring (yields an integer from 0 to 2^32-1)
  2. Walk clockwise around the ring
  3. The first server you encounter owns that key

When a server joins at ring position P, only keys that previously mapped to the next server clockwise need to migrate to P. Every other key stays on its current server. With K total keys and N servers, adding one server remaps approximately K/(N+1) keys — the theoretical minimum. This is orders of magnitude better than hash % N, which remaps N/(N+1) of all keys.

graph LR
    subgraph Ring ["Hash Ring (0 to 2³²-1)"]
        direction LR
        S1["Server A<br/>pos: 0x1A3F"]
        S2["Server B<br/>pos: 0x5E02"]
        S3["Server C<br/>pos: 0x9B7D"]
    end

    K1(["key: user:42<br/>hash: 0x3C11"]) -->|clockwise| S2
    K2(["key: session:99<br/>hash: 0x7F20"]) -->|clockwise| S3
    K3(["key: cart:17<br/>hash: 0x0E55"]) -->|clockwise| S1

    S1 -.->|"ring order"| S2
    S2 -.->|"ring order"| S3
    S3 -.->|"wraps to"| S1

Naive implementation (Go):

package chash
 
import (
	"hash/crc32"
	"sort"
	"sync"
)
 
type Ring struct {
	mu     sync.RWMutex
	nodes  map[uint32]string // hash position → node ID
	sorted []uint32          // sorted positions for binary search
}
 
func New() *Ring {
	return &Ring{nodes: make(map[uint32]string)}
}
 
func (r *Ring) GetNode(key string) string {
	r.mu.RLock()
	defer r.mu.RUnlock()
 
	if len(r.sorted) == 0 {
		return ""
	}
 
	h := crc32.ChecksumIEEE([]byte(key))
 
	// Binary search: first node position >= h
	idx := sort.Search(len(r.sorted), func(i int) bool {
		return r.sorted[i] >= h
	})
 
	// Wrap around to ring start if past the end
	if idx == len(r.sorted) {
		idx = 0
	}
 
	return r.nodes[r.sorted[idx]]
}
 
func (r *Ring) AddNode(nodeID string) {
	r.mu.Lock()
	defer r.mu.Unlock()
 
	h := crc32.ChecksumIEEE([]byte(nodeID))
	r.nodes[h] = nodeID
	r.sorted = append(r.sorted, h)
	sort.Slice(r.sorted, func(i, j int) bool { return r.sorted[i] < r.sorted[j] })
}

The flaw: with only 3 physical nodes, the ring has 3 arcs. Hash functions rarely place nodes at equal intervals, so load distribution is uneven (one node handling 60%, another 10%). [Karger et al., 1997]

Virtual Nodes: Even Load Distribution

Production systems place each physical server at multiple ring positions — typically 150 to 300 per node (Dynamo introduced this multi-token-per-node design for the same reason).[DeCandia et al., 2007] This smooths load distribution and prevents the uneven load we saw in the naive implementation.

Without virtual nodes, 3 physical nodes produce 3 ring positions. A hash function rarely spaces them equally around the 4-billion-position ring, so load variance is severe (one node 60%, another 10%). Virtual nodes fix this: each physical node gets many replicas. [Karger et al., 1997]

graph LR
    subgraph Before["Naive ring (3 nodes, 3 positions)"]
        N1["Node A<br/>~60%"] -.-> N2["Node B<br/>~30%"]
        N2 -.-> N3["Node C<br/>~10%"]
        N3 -.-> N1
    end
    subgraph After["Virtual nodes (3 nodes × 150 vnodes)"]
        V1(("A1..A150"))
        V2(("B1..B150"))
        V3(("C1..C150"))
        V1 --- V2
        V2 --- V3
        V3 --- V1
    end
    Before -->|"add 150 vnodes<br/>per physical node"| After

With 150 virtual nodes per physical node in a 10-node cluster:

- 1,500 total ring positions
- Each physical node occupies ~150 positions
- Keys distribute across all 150 positions uniformly
- Each position handles ~0.1% of keys
- Standard deviation ≈ 4% (good; target is ±4–5%)

Java implementation with vnodes:

import java.nio.charset.StandardCharsets;
import java.util.*;
import java.util.zip.CRC32;
 
public class ConsistentHashRing {
    private final SortedMap<Long, String> ring = new TreeMap<>();
    private final int vnodes;
 
    public ConsistentHashRing(int vnodeCount) {
        this.vnodes = vnodeCount;
    }
 
    private long hash(String key) {
        CRC32 crc = new CRC32();
        crc.update(key.getBytes(StandardCharsets.UTF_8));
        return crc.getValue();
    }
 
    public void addNode(String nodeId) {
        for (int i = 0; i < vnodes; i++) {
            String vnode = nodeId + ":" + i;
            ring.put(hash(vnode), nodeId);
        }
    }
 
    public void removeNode(String nodeId) {
        for (int i = 0; i < vnodes; i++) {
            ring.remove(hash(nodeId + ":" + i));
        }
    }
 
    public String getNode(String key) {
        if (ring.isEmpty()) return null;
 
        long h = hash(key);
        SortedMap<Long, String> tail = ring.tailMap(h);
        long nodeHash = tail.isEmpty() ? ring.firstKey() : tail.firstKey();
        return ring.get(nodeHash);
    }
}

Each physical node adds vnodes entries to the ring. Removing a node removes all its virtual replicas at once.

Bounded-Load Consistent Hashing

Google's 2016 improvement[Mirrokni et al., 2017]: limit each node to K/N * (1 + epsilon) keys. When a key hashes to an overloaded node, try the next node clockwise. This prevents any single node from becoming a hot spot.

In practice: pick epsilon = 0.25 (nodes can hold 125% of the theoretical average). For a 100-node cluster with 10M keys, each node normally holds 100K keys; overloaded nodes can hold up to 125K before rebalancing to the next node. [Karger et al., 1997]

Production Checklist

  • Use MurmurHash3, xxHash, or FNV-1a — not CRC32 (clusters sequential inputs) or cryptographic hashes (overkill)
  • Set 150–256 virtual nodes per physical node for cache rings — 150 is a safe baseline; Cassandra defaults to 16 since 4.0 (legacy installs use 256)[Apache Cassandra num_tokens]. Diminishing returns past 256.
  • Guard all ring operations with a read-write lock — writes (add/remove) block reads (get-node)
  • Test load distribution — collect ring position samples, measure std deviation; target ±4–5%
  • Plan rebalancing windows — adding a node triggers ~K/(N+1) key migrations; batch adds in off-peak hours
  • Monitor single-node hot keys — consistent hashing distributes keys, not load; a single key referenced 1000x/sec still hits one node
  • Handle node failures gracefully — rebalancing to the next node clockwise (or next N replicas in a replicated system)

Production ring with thread-safe ops + a load-distribution test

A complete Ring type that wraps the algorithm with a single RWMutex (one per ring, not per slot — slot-level locking is overkill for this access pattern):

package consistenthash
 
import (
    "hash/fnv"
    "sort"
    "strconv"
    "sync"
)
 
type Ring struct {
    mu        sync.RWMutex
    vnodes    int
    slots     []uint32              // sorted slot positions
    owners    map[uint32]string     // slot -> physical node
}
 
func NewRing(vnodes int) *Ring {
    return &Ring{vnodes: vnodes, owners: make(map[uint32]string)}
}
 
func hash(s string) uint32 {
    h := fnv.New32a(); _, _ = h.Write([]byte(s)); return h.Sum32()
}
 
func (r *Ring) AddNode(node string) {
    r.mu.Lock(); defer r.mu.Unlock()
    for i := 0; i < r.vnodes; i++ {
        slot := hash(node + ":" + strconv.Itoa(i))
        r.slots = append(r.slots, slot)
        r.owners[slot] = node
    }
    sort.Slice(r.slots, func(i, j int) bool { return r.slots[i] < r.slots[j] })
}
 
func (r *Ring) GetNode(key string) string {
    r.mu.RLock(); defer r.mu.RUnlock()
    if len(r.slots) == 0 { return "" }
    h := hash(key)
    // Binary search: first slot >= h, wrap to slots[0] if none.
    idx := sort.Search(len(r.slots), func(i int) bool { return r.slots[i] >= h })
    if idx == len(r.slots) { idx = 0 }
    return r.owners[r.slots[idx]]
}

Removal: same as add but reverse — drop every (node, vnode) slot. The expensive bit is the sort.Slice on rebuild; for high-churn rings, swap to a sorted-map structure (e.g., github.com/igrmk/treemap):

func (r *Ring) RemoveNode(node string) {
    r.mu.Lock(); defer r.mu.Unlock()
    keep := r.slots[:0]
    for _, s := range r.slots {
        if r.owners[s] != node {
            keep = append(keep, s)
        } else {
            delete(r.owners, s)
        }
    }
    r.slots = keep
    // Already sorted; removal preserves order, no re-sort needed.
}

The load-distribution benchmark every ring deserves — paste it into your test suite as a regression guard. A change that drops vnode count or switches the hash function shows up as a stddev jump:

func TestRing_LoadDistribution(t *testing.T) {
    r := NewRing(150) // 150 vnodes is the production sweet spot
    nodes := []string{"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}
    for _, n := range nodes { r.AddNode(n) }
 
    counts := make(map[string]int)
    const N = 1_000_000
    for i := 0; i < N; i++ {
        counts[r.GetNode(strconv.Itoa(i))]++
    }
 
    expected := float64(N) / float64(len(nodes))
    var sumSqDiff float64
    for _, c := range counts {
        d := float64(c) - expected
        sumSqDiff += d * d
    }
    stdDev := math.Sqrt(sumSqDiff/float64(len(nodes))) / expected
    if stdDev > 0.06 { // 6% — generous for 150 vnodes / 10 nodes
        t.Errorf("load distribution stddev %.4f > 0.06; check vnode count", stdDev)
    }
}

A rebalance-event handler — emit a structured event when a node joins/leaves so dependent caches can pre-warm or invalidate the migrated keys instead of waiting for cold-cache misses:

type RebalanceEvent struct {
    Type    string    // "node_added" | "node_removed"
    Node    string
    Migrated int      // count of keys that move to/from this node
    OccurredAt time.Time
}
 
// AddNodeWithEvent wraps AddNode and emits the event after the ring has
// the new layout. The event handler runs under the RLock so dependents
// see a consistent view.
func (r *Ring) AddNodeWithEvent(node string, sink chan<- RebalanceEvent) {
    r.AddNode(node)
    sink <- RebalanceEvent{Type: "node_added", Node: node, OccurredAt: time.Now()}
}

Subscribers do their own bookkeeping — the ring stays a pure data structure. The pattern beats baking notification into the ring directly because every consumer wants different semantics (warm-on-add, invalidate-on-remove, log-only).

The Cassandra-style num_tokens config that ships with the ring — vnode count is the only knob teams routinely tune wrong. Cassandra dropped from 256 to 16 in 4.0 because the rebalance cost grew superlinearly with vnode count:

# ring.yaml — service-side ring configuration
ring:
  vnodes_per_node: 150          # 150 = balanced cache ring; 16 = Cassandra-style
  hash_function: murmur3        # CRC32 has clustering bug for sequential keys
  rebalance_batch_size: 100     # max concurrent migrations during topology change
  rebalance_off_peak_only: true # skip rebalance during business-hours peak

Jump Consistent Hash: When the Ring Is Overkill

Lamping and Veach published Jump Consistent Hash at Google in 2014. The algorithm fits in seven lines, allocates zero memory, and outperforms ring-based hashing by 5-10x on lookup throughput. The catch: it only supports buckets numbered 0 to N-1. Removing bucket 7 from a 100-bucket ring is impossible without renumbering — the algorithm assumes contiguous bucket IDs. [Karger et al., 1997]

The math is a probabilistic walk: for each bucket count from 1 to N, decide whether the key "jumps" to the new bucket based on a deterministic random sequence seeded by the key. The expected number of jumps is O(ln N), so the loop runs about 5 iterations for a 1000-bucket cluster.

// JumpHash returns a bucket in [0, numBuckets) for the given key.
// Source: Lamping & Veach, "A Fast, Minimal Memory, Consistent Hash Algorithm" (2014)
func JumpHash(key uint64, numBuckets int32) int32 {
    var b, j int64 = -1, 0
    for j < int64(numBuckets) {
        b = j
        key = key*2862933555777941757 + 1
        j = int64(float64(b+1) * (float64(int64(1)<<31) / float64((key>>33)+1)))
    }
    return int32(b)
}

Use Jump when buckets are append-only (shard IDs in a sharded database, partition IDs in Kafka with manual rebalancing) and you need sub-microsecond lookups. Avoid Jump when nodes can fail and need replacement at arbitrary positions — the renumbering cost defeats the algorithm's elegance. Google uses Jump internally for their bucketing layers where node IDs are stable.

Real-World Tuning: Memcached, DynamoDB, Cassandra

The three production rings every backend engineer eventually touches differ more than the textbook suggests. Their tuning choices reveal what each system actually optimizes for.

Memcached (libketama): 160 vnodes per node, MD5-derived 32-bit positions. The libketama algorithm bundled with pylibmc, spymemcached, and mcrouter predates the consistent-hashing-with-bounded-loads era — it has no bounded-load enforcement. Hot keys land where they land. Production deployments add a level-1 in-process LRU on each app server to absorb hot-key traffic before the ring lookup ever runs. Twemproxy and mcrouter both implement libketama-compatible distribution so ring decisions stay consistent across client libraries.

DynamoDB: undocumented internally, but the original Dynamo paper described a 128-token-per-node scheme with three replicas walking clockwise. The current managed DynamoDB service hides the ring entirely — adaptive capacity rebalances hot partitions automatically by splitting them when one partition exceeds 3000 RCUs or 1000 WCUs sustained. The lesson: AWS bought you out of vnode tuning by making the ring opaque and giving you OnDemand mode where partition splits happen behind the scenes.

Cassandra: dropped from 256 to 16 vnodes per node in 4.0. The reason is rebalancing cost — bootstrap time grew superlinearly with vnode count because the streaming subsystem had to negotiate transfers for every (source vnode, destination vnode) pair. With 256 vnodes per node and a 100-node cluster, a single bootstrap touches 25,600 source ranges. Sixteen vnodes brings that to 1,600 — still distributed, but bootstrap completes in hours instead of days.

Bounded-Load vs Power-of-Two-Choices

Both algorithms tackle the same problem — uneven load with naive consistent hashing — with different trade-offs. The choice between them depends on whether you can tolerate per-key state and how strict your tail-latency targets are.

Bounded-load (Mirrokni et al., Google 2016): cap each node at K/N * (1 + epsilon) keys; overflow walks clockwise to the next node. Per-request cost is one ring lookup plus up to 1/epsilon clockwise probes in the worst case. With epsilon=0.25, worst case is 4 extra probes; in practice, average is 1.05 lookups when load factor is below 80%. [Mirrokni et al., 2017]

Power-of-two-choices (Mitzenmacher 1996, applied to consistent hashing in Maglev and Vimeo's Vitess): hash the key with two independent hash functions, pick the less-loaded of the two candidate nodes. Cost is always exactly two lookups. Variance reduction is dramatic: max-load drops from O(log N / log log N) to O(log log N) compared with random placement.

Concrete numbers from a 50-node ring under 1M keys/sec:

AlgorithmAvg lookupsP99 lookupsMax-load ratioState per node
Naive consistent hashing1.001.001.45none
Bounded-load (eps=0.25)1.052.001.25counter (atomic int)
Power-of-two-choices2.002.001.10counter per replica
Maglev (Google L4)1.001.001.0165,537-entry table

Bounded-load wins when keys are written far more than read — the counter is updated once per key insertion, not per lookup. Power-of-two-choices wins when read-heavy traffic dominates and you want predictable two-lookup cost. Maglev wins when you control the data plane (Google's L4 load balancer) and can afford to recompute the lookup table on topology change.

A minimal power-of-two-choices selector on top of a vnode ring — h1 and h2 use distinct seeds so collisions stay rare:

type LoadAwareRing struct {
    *Ring
    load map[string]*atomic.Int64 // node -> in-flight key count
}
 
func (r *LoadAwareRing) PickNode(key string) string {
    a := r.GetNode(key + ":a")
    b := r.GetNode(key + ":b")
    if r.load[a].Load() <= r.load[b].Load() {
        return a
    }
    return b
}

The two reads on r.load race with concurrent updates, but the worst case is picking the slightly-worse node — never an unsafe one. For strict accounting, swap the atomic for a sharded counter and accept the extra cache-line churn.

Frequently Asked Questions

What is consistent hashing?

Consistent hashing maps both keys and servers onto a fixed-size ring. When servers are added or removed, only K/(N+1) keys need to remap — the theoretical minimum — instead of the near-total remapping caused by modular hashing (hash % N).

How many virtual nodes should I use per physical node?

In our simulations, 150 virtual nodes per physical node lands around ±4% standard deviation in load — a sensible default for most production systems. Cassandra defaults to 16 (since 4.0) which trades load variance for faster bootstrap and topology operations; legacy Cassandra installs used 256. Going above 256 offers diminishing returns with higher memory cost. DynamoDB does not publish its vnode count publicly — measure for your own ring. [Karger et al., 1997]

What is the difference between consistent hashing and rendezvous hashing?

Both achieve minimal key redistribution on node changes. Consistent hashing uses a ring with O(log N) lookups but needs virtual nodes for balance. Rendezvous hashing scores every node per key with O(N) lookups but is naturally balanced without virtual nodes. Use consistent hashing for large clusters (100+ nodes) and rendezvous hashing for small clusters.

Which hash function should I use for consistent hashing?

Use MurmurHash3, xxHash, or FNV-1a. CRC32 produces visible clustering for sequential inputs. Cryptographic hashes like MD5 or SHA-256 are overkill — you need uniform distribution, not collision resistance. Cassandra uses MurmurHash3 by default.

Keep Reading

BackendBytes Engineering Team
BackendBytes Engineering Team

Engineering Team

A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.

Read Next