Skip to content

Distributed Rate Limiting at Scale: The Probabilistic Drop Architecture

BackendBytes Engineering Team
BackendBytes Engineering Team
6 min read
Distributed Rate Limiting at Scale: The Probabilistic Drop Architecture

Key Takeaways

  • Redis rate limiting fails at >50K RPS — each request needs a 0.5-2ms round trip; at 80K RPS, you need 80K Redis commands per second, and Redis becomes the bottleneck it was meant to prevent
  • Probabilistic drop decouples enforcement from coordination: Allow() is a pure in-memory local decision using a cached drop_ratio; directives pushed asynchronously every 1-2 seconds, no per-request cost
  • Drop ratio formula: (actual - limit) / actual; each instance flips a biased coin independently — no coordination between instances, graceful degradation when coordinator fails
  • Coordinator failure: instances continue enforcing with the last-known ratio, never fail open (allow all) or closed (deny all), preventing cascading outages

The classic high-RPS rate-limiter production incident. A team starts with Redis-based token bucket rate limiting. At ~50K RPS, the Redis cluster saturates — p99 latency climbs into the tens of ms, CPU hits 100%, the rate limiter becomes the bottleneck it was supposed to prevent. We debugged this incident on multiple production teams. [Google Doorman]

Coordinator-based rate limiting fails at scale because every request needs a round trip[Beyer et al., 2016]. The solution, popularised by Google's Doorman[Google Doorman] and similar designs: eliminate the coordinator from the hot path. Move enforcement to the edge with an asynchronously-pushed directive — clients enforce locally with a cached drop ratio, the controller never touches the request path.

graph LR
    subgraph T1["Tier 1: Client (per service instance)"]
        C1[Allow] -->|local in-memory<br/>~100ns| Coin[Biased coin flip]
        Coin -->|rand &lt; ratio| Drop[Drop]
        Coin -->|else| Pass[Pass]
    end
    subgraph T2["Tier 2: Aggregator (zonal)"]
        A1[Per-bucket counts] --> A2[Roll up + push directive]
    end
    subgraph T3["Tier 3: Controller"]
        Ctrl["Compute drop_ratio<br/>= (actual - limit) / actual"]
    end
    Pass -.->|count| A1
    Drop -.->|count| A1
    A2 -->|every 1–2s| Ctrl
    Ctrl -->|directive push| C1
    style Drop fill:#fee
    style Pass fill:#efe

The hot path (left subgraph) is local-only. The control loop (centre + right) runs asynchronously — coordinator latency or failure can't stall request handling.

TL;DR

Probabilistic drop decouples enforcement from coordination[Google Doorman]. Each instance makes local allow/deny decisions using a drop ratio computed asynchronously by a central controller. The Allow() call is a pure in-memory operation — zero network overhead, graceful degradation when the controller fails.

  • Drop ratio formula: (actual - limit) / actual; every instance flips a biased coin independently
  • Directives pushed asynchronously every 1–2 seconds; lag is acceptable for overload protection
  • Coordinator failure: instances keep enforcing with the last-known ratio, never fail open or closed

Rate-Limiting Algorithm Comparison

[Redis Docs]
ApproachThroughput ceilingCoordinationPer-request costAccuracyBest for
In-memory token bucket (per-pod)Pod CPU boundNone~50 nsPer-pod onlySingle-replica services
Redis token bucket (Lua atomic)~50K RPSHigh — Redis-bound0.5-2 ms RTTExact globalBilling, exact counting, low concurrency
Redis sliding window~30K RPSHigh1-3 ms RTTExact globalPer-tenant SLA enforcement
Probabilistic drop (this article)over 1 million RPSNone on hot path~100 ns local flipStatistical (~1% drift)Overload protection at scale
Doorman / centralised quotadepends on coordinatorHeavy1-5 msExact within rebalance windowLong-lived clients with quota leases
Client-side adaptive (e.g. circuit-breaker hybrid)over 1 million RPSNone~50 nsServer-driven, lossyBrowser / mobile clients

Redis scales linearly with load: at 80K RPS, you need 80K commands/second — already past Redis's ~50K practical ceiling for Lua-based limiting. At high concurrency, the Redis connector becomes the bottleneck. Probabilistic drop removes the coordinator from the request path: Allow() is a pure in-memory decision using a cached directive.

How the Drop Ratio Works

[Google Doorman]

The three-tier feedback loop converges global throughput to the limit without coordinating per-request:

graph LR
    Req[Incoming request] --> C[Tier 1 — Client<br/>biased coin flip<br/>local in-memory]
    C -->|allowed| App[Application]
    C -->|dropped| Reject[429]
    App -->|sample stats async| A[Tier 2 — Aggregator<br/>async loop<br/>1-2s window]
    A -->|global RPS| Ctrl[Tier 3 — Controller<br/>computes drop_ratio]
    Ctrl -->|push directive<br/>every 1-2s| C
    style Reject fill:#fdd
    style App fill:#dfd
    style Ctrl fill:#ffd

The controller computes: drop_ratio = (actual - limit) / actual

Example: limit = 1,000 RPS, actual = 1,200 RPS → drop_ratio = 0.167 (drop 16.7% of requests). [Google Doorman]

Each instance independently flips a biased coin: if rand() < drop_ratio { drop() }. Because all instances use the same ratio, the aggregate effect brings global throughput to the limit:

global_actual × (1 - drop_ratio) ≈ limit

Uncoordinated: no instance talks to another during a request. Directives are pushed asynchronously every 1–2 seconds, which is acceptable for overload protection (not billing or compliance).

Tier 1: The Client (Allow)

package ratelimit
 
import (
	"math/rand/v2"
	"sync"
	"sync/atomic"
	"time"
)
 
type Bucket string
 
type Directive struct {
	DropRatio float64
	UpdatedAt time.Time
}
 
type Client struct {
	mu         sync.RWMutex
	directives map[Bucket]Directive
	counters   map[Bucket]*atomic.Int64
	reportCh   chan<- map[Bucket]int64
}
 
func NewClient(reportCh chan<- map[Bucket]int64) *Client {
	return &Client{
		directives: make(map[Bucket]Directive),
		counters:   make(map[Bucket]*atomic.Int64),
		reportCh:   reportCh,
	}
}
 
// Allow returns true if the request should be processed.
// Safe for concurrent use, never blocks.
func (c *Client) Allow(b Bucket) bool {
	c.counter(b).Add(1)
 
	c.mu.RLock()
	d, ok := c.directives[b]
	c.mu.RUnlock()
 
	if !ok || d.DropRatio <= 0 {
		return true
	}
	if d.DropRatio >= 1.0 {
		return false
	}
	return rand.Float64() >= d.DropRatio
}
 
// UpdateDirective installs a new directive from the aggregator.
func (c *Client) UpdateDirective(b Bucket, d Directive) {
	c.mu.Lock()
	c.directives[b] = d
	c.mu.Unlock()
}
 
func (c *Client) counter(b Bucket) *atomic.Int64 {
	c.mu.RLock()
	counter, ok := c.counters[b]
	c.mu.RUnlock()
	if ok {
		return counter
	}
	c.mu.Lock()
	defer c.mu.Unlock()
	if counter, ok = c.counters[b]; ok {
		return counter
	}
	counter = &atomic.Int64{}
	c.counters[b] = counter
	return counter
}
 
// StartReporting flushes counters every interval.
func (c *Client) StartReporting(ctx interface{ Done() <-chan struct{} }, interval time.Duration) {
	go func() {
		ticker := time.NewTicker(interval)
		defer ticker.Stop()
		for {
			select {
			case <-ctx.Done():
				return
			case <-ticker.C:
				c.flush()
			}
		}
	}()
}
 
func (c *Client) flush() {
	c.mu.Lock()
	batch := make(map[Bucket]int64)
	for b, counter := range c.counters {
		if count := counter.Swap(0); count > 0 {
			batch[b] = count
		}
	}
	c.mu.Unlock()
 
	if len(batch) == 0 {
		return
	}
	select {
	case c.reportCh <- batch:
	default:
		// Don't block if aggregator is slow
	}
}

This is all you need: Allow() checks directives in-memory, never touches the network. UpdateDirective() is called asynchronously by the aggregator.

Common Gotchas

  • Directive lag: Directives are 1–2 seconds stale. For overload protection this is fine; for billing or compliance, use Redis-based exact counting instead.
  • Uneven drops per instance: Expected. One instance may drop 16%, another 18%. The aggregate converges to the limit.
  • New buckets: A new bucket with no controller history won't have a directive. The client allows all requests until the first directive arrives.
  • Coordinator failure: The last-known directive stays active. Enforcement becomes approximate but never fails open. Use shadow mode to validate limits before enabling.
  • Counter overflow: At >1B RPS per bucket, atomic counters wrap. Use uint64 or reset counters more frequently. [Google Doorman]

Tier 2: The Aggregator (async loop)

The aggregator receives counter batches from clients, computes global rates, and pushes directives back:

type Controller interface {
	ReportStats(ctx context.Context, zone string, counts map[Bucket]int64) (map[Bucket]int64, error)
}
 
type Aggregator struct {
	zone       string
	controller Controller
	reportCh   <-chan map[Bucket]int64
	clients    []*Client
	mu         sync.RWMutex
	window     map[Bucket]int64 // rolling 1-second count per bucket
}
 
func (a *Aggregator) Run(ctx context.Context) {
	ticker := time.NewTicker(500 * time.Millisecond)
	defer ticker.Stop()
	for {
		select {
		case <-ctx.Done():
			return
		case batch := <-a.reportCh:
			a.mu.Lock()
			for b, count := range batch {
				a.window[b] += count
			}
			a.mu.Unlock()
		case <-ticker.C:
			a.mu.RLock()
			counts := a.window
			a.window = make(map[Bucket]int64)
			a.mu.RUnlock()
 
			quotas, err := a.controller.ReportStats(ctx, a.zone, counts)
			if err != nil {
				continue // keep existing directives
			}
 
			// Compute and push directives
			for b, limit := range quotas {
				actual := counts[b]
				var dropRatio float64
				if actual > limit && actual > 0 {
					dropRatio = float64(actual-limit) / float64(actual)
				}
				for _, c := range a.clients {
					c.UpdateDirective(b, Directive{DropRatio: dropRatio, UpdatedAt: time.Now()})
				}
			}
		}
	}
}

Runs in a sidecar or separate goroutine. The controller is your rate limit source-of-truth: it returns per-bucket limits based on SLOs or historical baseline.

Tier 3: The Controller (limit computation)

[Beyer et al., 2016]
// ComputeLimit: measure baseline over 7 days, set limit at 120% of p95 daily peak.
func (c *Controller) ComputeLimit(b Bucket) int64 {
	samples := c.history[b] // []float64, last 7 days
	if len(samples) < 3 {
		return 0 // no limit until history available
	}
	sort.Float64s(samples)
	idx := int(float64(len(samples)) * 0.95)
	if idx >= len(samples) {
		idx = len(samples) - 1
	}
	return int64(samples[idx] * 1.20)
}

This is optional. You can hardcode limits if your SLOs are stable.

Production Checklist

  • Allow() microbenchmark: should be <200ns per request (in-memory only)
  • Directive update latency: < 100ms from controller decision to all clients
  • Shadow mode: run 48 hours, monitor shadow_drop metric; if >5% of traffic, limits are too tight
  • Coordinator failure: verify clients keep enforcing with the last-known directive
  • Counter reset: ensure counters don't wrap at high RPS (use uint64, reset every 1-2 seconds)
  • Middleware integration: wire Allow() into your HTTP/gRPC stack, return 429 + Retry-After [Google Doorman]

When to use probabilistic drop: overload protection (cascade prevention), high throughput (>50K RPS), acceptable eventual consistency (1–2s lag). When to use Redis: billing, compliance, exact request counting.

Microbenchmark, shadow mode, and the 429 contract

The Allow() microbenchmark — set this as a regression gate so a refactor that adds a syscall to the hot path fails CI:

func BenchmarkClient_Allow(b *testing.B) {
    bucket := Bucket("tenant:t:resource:r")
    c := NewClient(make(chan map[Bucket]int64, 1))
    c.UpdateDirective(bucket, Directive{DropRatio: 0.3, UpdatedAt: time.Now()})
 
    b.ReportAllocs()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = c.Allow(bucket)
    }
    // Acceptable: <200ns/op, 0 allocs/op. Anything above 500ns means a
    // syscall or allocation crept into the hot path.
}

Shadow-mode middleware — runs the limiter and emits the metric but never rejects. Use this to measure what production traffic would be rejected before you flip the kill switch:

type Mode int
 
const (
    ModeShadow Mode = iota   // observe-only; never returns 429
    ModeEnforce              // production
)
 
func RateLimitMiddleware(c *Client, mode Mode, m *prometheus.CounterVec, next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        bucket := bucketFor(r) // e.g. Bucket("tenant:" + tenantID(r))
        if !c.Allow(bucket) {
            m.WithLabelValues("shadow_drop").Inc()
            if mode == ModeEnforce {
                w.Header().Set("Retry-After", "1")
                w.Header().Set("X-RateLimit-Reason", "tenant_quota_exceeded")
                http.Error(w, "rate limited", http.StatusTooManyRequests)
                return
            }
        }
        next.ServeHTTP(w, r)
    })
}

A 48-hour shadow run before enforcement is non-negotiable — if shadow_drop exceeds 5% of traffic with limits you thought were generous, your limits are too tight and would have caused a customer incident on day one. Tune against the shadow metric, then flip to ModeEnforce. [Google Doorman]

The controller circuit breaker — when the directive feed stalls, freeze enforcement at the last-known directive instead of failing open or closed. This is the canonical UpdateDirective from Tier 1, hardened with a staleness guard (same sync.RWMutex-protected directives map, plus a metrics counter on the struct):

func (c *Client) UpdateDirective(b Bucket, d Directive) {
    if time.Since(d.UpdatedAt) > 30*time.Second {
        // Reject obviously stale directives — controller is misbehaving.
        // Keep the last good directive; the in-memory enforcement is
        // approximate but never fails open or closed.
        c.staleDirectives.Add(1)
        return
    }
    c.mu.Lock()
    c.directives[b] = d
    c.mu.Unlock()
}

The 429 response contract every limiter rejection should honour — Retry-After (seconds-or-HTTP-date) plus the per-bucket counters so the client can self-throttle without retrying blindly:

HTTP/1.1 429 Too Many Requests
Content-Type: application/problem+json
Retry-After: 1
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714579200
X-RateLimit-Reason: tenant_quota_exceeded
 
{
  "type": "https://api.example.com/errors/rate-limited",
  "title": "Rate limit exceeded",
  "detail": "Tenant 't_42' exceeded 1000 requests/sec; retry in 1s.",
  "status": 429
}

The single most-skipped header is X-RateLimit-Reset — without it, well-behaved SDKs cannot back off precisely and end up either polling too aggressively (turning the 429 into sustained load) or sleeping too long (exaggerating the customer-visible latency).

A client SDK snippet that consumes the headers correctly — sleep until X-RateLimit-Reset, not a fixed exponential backoff that ignores the server's hint:

func (c *Client) Do(req *http.Request) (*http.Response, error) {
    resp, err := c.http.Do(req)
    if err != nil { return nil, err }
    if resp.StatusCode == 429 {
        resp.Body.Close()
        if reset := resp.Header.Get("X-RateLimit-Reset"); reset != "" {
            ts, _ := strconv.ParseInt(reset, 10, 64)
            time.Sleep(time.Until(time.Unix(ts, 0)))
        } else {
            time.Sleep(parseRetryAfter(resp.Header.Get("Retry-After")))
        }
        return c.http.Do(req)
    }
    return resp, nil
}

Composing Probabilistic Drop With Redis-Based Exact Counting

Probabilistic drop is the right primitive for overload protection — it cannot be the only primitive in a serious production system. Billing meters, free-tier quotas, fraud thresholds, and abuse mitigation all require exact counting that converges on a single number rather than a statistical mean. The composition pattern that consistently survives audit is a two-layer pipeline: probabilistic drop in front, Redis-based token bucket behind it, with each layer optimised for the question it actually answers.

The front layer answers "is the cluster overloaded right now". The back layer answers "has tenant X consumed their hourly quota". The front layer fires on every request at ~100ns; the back layer only fires on the survivors of the front layer, which is a fraction of nominal traffic and almost zero during incidents. This inverts the cost curve of a naive Redis-only setup: when traffic spikes, Redis sees less load, not more, because the probabilistic layer absorbs the surplus before it ever reaches the coordinator.

type LayeredLimiter struct {
    fast   *Client                   // tier-1 probabilistic drop
    exact  *RedisTokenBucket         // tier-2 exact counting
    metrics *prometheus.CounterVec
}
 
func (l *LayeredLimiter) Allow(ctx context.Context, b Bucket) (bool, string) {
    if !l.fast.Allow(b) {
        l.metrics.WithLabelValues("drop_fast").Inc()
        return false, "cluster_overload"
    }
    ok, err := l.exact.Allow(ctx, b)
    if err != nil {
        // Redis unavailable — fall back to fast-layer decision only.
        // We already passed the cluster overload check, so allow.
        l.metrics.WithLabelValues("redis_degraded").Inc()
        return true, "redis_degraded_passthrough"
    }
    if !ok {
        l.metrics.WithLabelValues("drop_quota").Inc()
        return false, "tenant_quota_exceeded"
    }
    return true, ""
}

The reason string is what the 429 contract returns in X-RateLimit-Reason — clients can tell the difference between "you hit your billing quota" (don't retry, talk to sales) and "the cluster is on fire" (retry with backoff). Lumping them together is the most common mistake teams make when bolting Redis onto an existing probabilistic drop deployment.

Multi-Tenancy Fairness With Weighted Fair Queueing

Probabilistic drop with a single global drop ratio treats every tenant as interchangeable, which is fine for a closed internal system and a disaster for a paid SaaS. A noisy free-tier tenant can crowd out a paying enterprise tenant when they collide in the same overloaded zone, and the only signal the limiter gives back is "everyone gets dropped at 18%". The fix is per-tenant weights pushed alongside the drop ratio, so each tenant's effective drop rate becomes a function of its share of the overage rather than a flat slice across the cluster. [Google Doorman]

type TenantDirective struct {
    DropRatio float64
    Weight    float64 // 1.0 = fair share, 4.0 = enterprise, 0.25 = free
    UpdatedAt time.Time
}
 
func (c *Client) AllowTenant(b Bucket, tenant string) bool {
    c.counter(b).Add(1)
    d := c.tenantDirective(b, tenant)
    if d.DropRatio <= 0 {
        return true
    }
    // Weighted drop: enterprise tenants drop less, free tenants drop more.
    effective := d.DropRatio / d.Weight
    if effective >= 1.0 {
        return false
    }
    return rand.Float64() >= effective
}

The controller computes weights from a tenant catalogue — typically billing tier, contract SLA, and a manually-set boost flag for tenants currently being courted by sales. Surface the weight table as a YAML resource the controller hot-reloads, so support engineers can adjust it during an incident without a code deploy:

tenants:
  default:
    weight: 1.0
    burst_multiplier: 1.5
  enterprise_acme:
    weight: 4.0           # 4x fair share during overload
    burst_multiplier: 3.0 # tolerate 3x sustained rate before drop
    notes: "Q2 contract, escalations to alex@"
  free_tier:
    weight: 0.25          # 4x more aggressive drop than fair
    burst_multiplier: 1.0
  partner_internal:
    weight: 2.0
    burst_multiplier: 2.0
fairness:
  algorithm: deficit_round_robin
  reweight_interval_s: 5
  min_floor_rps: 5        # never drop a tenant below 5 rps

The min_floor_rps floor is non-obvious but matters: without it, a free-tier tenant making one health-check call per minute can be dropped at 99% during overload because their rate is statistically indistinguishable from noise. A floor below which probabilistic drop is disabled keeps customer-visible signals like login flows working through the worst-case incident. [Google Doorman]

Real Production Deployment: Shadow, Canary, Enforce

Rolling probabilistic drop into a production system that previously had no global rate limiter — or one based on Redis that you are replacing — is the kind of change that breaks customers if you sequence it wrong. The deployment pattern that consistently lands without a customer incident is a three-stage rollout with explicit metric gates between each stage and a documented rollback signal.

Stage one is shadow mode for 7 days. The limiter runs end-to-end — directives flow, counters increment, the middleware emits shadow_drop — but the rejection branch is dead code. The gate to advance is shadow_drop / total_requests < 0.005 sustained over a 24-hour window; anything above 0.5% means the limit table is wrong and would have caused a measurable customer impact on day one. [Google Doorman]

Stage two is canary enforcement on 5% of traffic for 48 hours. The same directive feed, but a fraction of pods flip to ModeEnforce. The gates are tighter: 429 rate must stay below 0.1% of total requests on the canary cohort, p99 latency on the canary must not exceed control by more than 5ms, and the customer support ticket volume on the limited tenants must not increase week-over-week. Any breach reverts the canary cohort to shadow mode and pages the on-call. [Google Doorman]

# Shadow → canary gate: shadow drop rate must be under 0.5%
sum(rate(ratelimit_decision_total{result="shadow_drop"}[1h]))
  /
sum(rate(ratelimit_decision_total[1h]))
< 0.005
 
# Canary → full enforce gate: canary 429s under 0.1% with no latency regression
(
  sum by (cohort) (rate(http_requests_total{cohort="canary",code="429"}[15m]))
    /
  sum by (cohort) (rate(http_requests_total{cohort="canary"}[15m]))
) < 0.001
and
(
  histogram_quantile(0.99, sum by (le, cohort) (rate(http_request_duration_seconds_bucket{cohort="canary"}[15m])))
    -
  histogram_quantile(0.99, sum by (le, cohort) (rate(http_request_duration_seconds_bucket{cohort="control"}[15m])))
) < 0.005
 
# Rollback signal: directive feed staleness > 60s on any zone
max by (zone) (time() - ratelimit_directive_last_update_timestamp_seconds) > 60

Stage three is full enforcement, but with a kill switch. The kill switch is a single boolean in the controller config that flips every client back to ModeShadow within one directive cycle (~2s). On the team that ran the original incident this article opens with, the kill switch was hit twice in the first month: once during a marketing campaign that legitimately tripled traffic (limits were too low), once during a Redis incident on a downstream service that caused upstream retry storms. Both times it bought the on-call enough breathing room to fix the underlying issue without the rate limiter becoming the visible failure.

The metric every team forgets to wire up is directive feed staleness. A frozen controller that keeps shipping the same directive looks healthy from the client's perspective — the last-known directive is still valid — until traffic shape shifts and the stale ratio is wildly wrong for current load. Alert on time_since_last_directive_update > 60s per zone, not just on controller error rate, and the on-call will catch the silent failure mode before it turns into an incident.

Frequently Asked Questions

Why does Redis-based rate limiting fail at high scale?

At 80K RPS across 10 services, you need 80K Redis commands per second. Network latency (0.5–2ms per check) and Redis CPU saturation turn the rate limiter into the bottleneck it was supposed to prevent.

What is probabilistic drop?

Each service instance makes local allow/deny decisions by randomly dropping a percentage of requests based on a drop_ratio. The ratio is computed by a central controller and pushed asynchronously — the Allow() call is purely local.

How does the three-tier architecture work?

Tier 1 (instances) make local decisions using cached drop_ratio. Tier 2 (zone aggregators) collect counts and push directives. Tier 3 (controller) computes global rates and distributes quota. No tier blocks the request path.

What happens if the coordinator fails?

Instances continue using the last-known drop_ratio. Enforcement becomes approximate but never fails open (allow all) or closed (deny all), unlike Redis-based designs.

Keep Reading

BackendBytes Engineering Team
BackendBytes Engineering Team

Engineering Team

A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.

Read Next