Microservices Redis System Design Caching

Scaling Redis for High-Throughput Systems

System Architecture Group•Oct 24, 2024

8 min read

Scaling Redis for High-Throughput Systems

Understanding the Bottlenecks

Before we shard, we need to understand why a single instance fails. It's usually not memory—it's CPU bound by the single-threaded nature of Redis's event loop or network bandwidth saturation.

Basic Cluster Configuration

# redis.conf - Basic Redis Cluster Configuration
port 6379
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
appendonly yes
 
# Performance Tuning
maxmemory 4gb
maxmemory-policy allkeys-lru

The Cluster Architecture

We use a 3-primary, 3-replica setup to ensure high availability. Hash slots (16,384 total) are distributed evenly across the primary nodes.

graph TD
    Client[Application Layer]
    
    subgraph RC [Redis Cluster]
        P1["Primary 1<br/>Slots 0-5460"]
        P2["Primary 2<br/>Slots 5461-10922"]
        P3["Primary 3<br/>Slots 10923-16383"]
        
        R1[Replica 1] -.-> P1
        R2[Replica 2] -.-> P2
        R3[Replica 3] -.-> P3
    end
    
    Client -->|"CRC16(key) % 16384"| Hash{Slot Lookup}
    Hash --> P1
    Hash --> P2
    Hash --> P3
    
    style P1 fill:#ffcdd2,stroke:#b71c1c,color:#000
    style P2 fill:#ffcdd2,stroke:#b71c1c,color:#000
    style P3 fill:#ffcdd2,stroke:#b71c1c,color:#000

Hash Slot Distribution

Primary 1: Slots 0-5460
Primary 2: Slots 5461-10922
Primary 3: Slots 10923-16383

Each primary has a dedicated replica for automatic failover.

Cache Invalidation Strategies

The hardest problem in computer science. We opted for a hybrid approach:

TTL-based: Set reasonable TTLs for eventual consistency
Active invalidation: Use Redis Pub/Sub for critical data
Write-through: Update cache on every write

Preventing Cache Stampedes

When a popular cache key expires, thousands of concurrent requests might simultaneously miss the cache and hit the database. This is known as the "Thundering Herd" problem.

To mitigate this, always add jitter to your TTLs:

import random
 
def set_with_jitter(client, key, value, ttl_seconds):
    """
    Sets a key with a random jitter to prevent thundering herds.
    """
    # Add ±15% jitter to the TTL
    jitter = ttl_seconds * 0.15
    actual_ttl = ttl_seconds + random.uniform(-jitter, jitter)
    
    client.setex(key, int(actual_ttl), value)

Connection Pooling

import redis
 
pool = redis.ConnectionPool(
    host='redis-cluster',
    port=6379,
    max_connections=50,
    socket_connect_timeout=2,
    socket_timeout=2,
    retry_on_timeout=True,
    health_check_interval=30,
)
 
client = redis.Redis(connection_pool=pool)

Benchmarking

Before going to production, you must validate your cluster's performance. Use redis-benchmark to simulate load:

# Simulate 100k requests with 50 concurrent clients
redis-benchmark -h redis-cluster -p 6379 -t set,get -n 100000 -c 50 -q
 
# Test pipeline performance (16 commands per pipeline)
redis-benchmark -h redis-cluster -p 6379 -t set,get -n 100000 -P 16 -q

Key targets:

Latency: < 1ms for 99th percentile
Throughput: > 50k ops/sec per node

Monitoring

Essential metrics to track:

Hit rate: Should be above 95%
Memory usage: Watch for fragmentation
Connected clients: Monitor for connection leaks
Keyspace misses: Indicates cache warming issues
Replication lag: Critical for read replicas

Conclusion

Scaling isn't just about adding more nodes; it's about understanding how data flows through them. By properly sharding and managing connections, Redis can easily handle millions of ops/sec.

BackendBytes Architect Team

System Architecture Group

Experts in distributed systems, scalability, and high-performance computing.

Scaling Redis for High-Throughput Systems

BackendBytes Architect Team

System Architecture Group•Oct 24, 2024

8 min read

Understanding the Bottlenecks

Before we shard, we need to understand why a single instance fails. It's usually not memory—it's CPU bound by the single-threaded nature of Redis's event loop or network bandwidth saturation.

Basic Cluster Configuration

# redis.conf - Basic Redis Cluster Configuration
port 6379
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
appendonly yes
 
# Performance Tuning
maxmemory 4gb
maxmemory-policy allkeys-lru

The Cluster Architecture

We use a 3-primary, 3-replica setup to ensure high availability. Hash slots (16,384 total) are distributed evenly across the primary nodes.

graph TD
    Client[Application Layer]
    
    subgraph RC [Redis Cluster]
        P1["Primary 1<br/>Slots 0-5460"]
        P2["Primary 2<br/>Slots 5461-10922"]
        P3["Primary 3<br/>Slots 10923-16383"]
        
        R1[Replica 1] -.-> P1
        R2[Replica 2] -.-> P2
        R3[Replica 3] -.-> P3
    end
    
    Client -->|"CRC16(key) % 16384"| Hash{Slot Lookup}
    Hash --> P1
    Hash --> P2
    Hash --> P3
    
    style P1 fill:#ffcdd2,stroke:#b71c1c,color:#000
    style P2 fill:#ffcdd2,stroke:#b71c1c,color:#000
    style P3 fill:#ffcdd2,stroke:#b71c1c,color:#000

Hash Slot Distribution

Primary 1: Slots 0-5460
Primary 2: Slots 5461-10922
Primary 3: Slots 10923-16383

Each primary has a dedicated replica for automatic failover.

Cache Invalidation Strategies

The hardest problem in computer science. We opted for a hybrid approach:

TTL-based: Set reasonable TTLs for eventual consistency
Active invalidation: Use Redis Pub/Sub for critical data
Write-through: Update cache on every write

Preventing Cache Stampedes

When a popular cache key expires, thousands of concurrent requests might simultaneously miss the cache and hit the database. This is known as the "Thundering Herd" problem.

To mitigate this, always add jitter to your TTLs:

import random
 
def set_with_jitter(client, key, value, ttl_seconds):
    """
    Sets a key with a random jitter to prevent thundering herds.
    """
    # Add ±15% jitter to the TTL
    jitter = ttl_seconds * 0.15
    actual_ttl = ttl_seconds + random.uniform(-jitter, jitter)
    
    client.setex(key, int(actual_ttl), value)

Connection Pooling

import redis
 
pool = redis.ConnectionPool(
    host='redis-cluster',
    port=6379,
    max_connections=50,
    socket_connect_timeout=2,
    socket_timeout=2,
    retry_on_timeout=True,
    health_check_interval=30,
)
 
client = redis.Redis(connection_pool=pool)

Benchmarking

Before going to production, you must validate your cluster's performance. Use redis-benchmark to simulate load:

# Simulate 100k requests with 50 concurrent clients
redis-benchmark -h redis-cluster -p 6379 -t set,get -n 100000 -c 50 -q
 
# Test pipeline performance (16 commands per pipeline)
redis-benchmark -h redis-cluster -p 6379 -t set,get -n 100000 -P 16 -q

Key targets:

Latency: < 1ms for 99th percentile
Throughput: > 50k ops/sec per node

Monitoring

Essential metrics to track:

Hit rate: Should be above 95%
Memory usage: Watch for fragmentation
Connected clients: Monitor for connection leaks
Keyspace misses: Indicates cache warming issues
Replication lag: Critical for read replicas

Conclusion

Scaling isn't just about adding more nodes; it's about understanding how data flows through them. By properly sharding and managing connections, Redis can easily handle millions of ops/sec.

BackendBytes Architect Team

System Architecture Group

Experts in distributed systems, scalability, and high-performance computing.

Scaling Redis for High-Throughput Systems

Understanding the Bottlenecks

Basic Cluster Configuration

The Cluster Architecture

Hash Slot Distribution

Cache Invalidation Strategies

Preventing Cache Stampedes

Connection Pooling

Benchmarking

Monitoring

Conclusion

Read Next

Caching Strategies at Scale: Beyond Simple Key-Value Stores

PostgreSQL vs MongoDB: The definitive guide for 2024

Building Resilient Distributed Systems with Go

Caching Strategies at Scale: Beyond Simple Key-Value Stores

PostgreSQL vs MongoDB: The definitive guide for 2024

Building Resilient Distributed Systems with Go

Scaling Redis for High-Throughput Systems

Understanding the Bottlenecks

Basic Cluster Configuration

The Cluster Architecture

Hash Slot Distribution

Cache Invalidation Strategies

Preventing Cache Stampedes

Connection Pooling

Benchmarking

Monitoring

Conclusion

Read Next

Caching Strategies at Scale: Beyond Simple Key-Value Stores

PostgreSQL vs MongoDB: The definitive guide for 2024

Building Resilient Distributed Systems with Go

Caching Strategies at Scale: Beyond Simple Key-Value Stores

PostgreSQL vs MongoDB: The definitive guide for 2024

Building Resilient Distributed Systems with Go