Distributed Rate Limiting at Scale: The Probabilistic Drop Architecture
Key Takeaways
- →Redis rate limiting fails at >50K RPS — each request needs a 0.5-2ms round trip; at 80K RPS, you need 80K Redis commands per second, and Redis becomes the bottleneck it was meant to prevent
- →Probabilistic drop decouples enforcement from coordination: Allow() is a pure in-memory local decision using a cached drop_ratio; directives pushed asynchronously every 1-2 seconds, no per-request cost
- →Drop ratio formula: (actual - limit) / actual; each instance flips a biased coin independently — no coordination between instances, graceful degradation when coordinator fails
- →Coordinator failure: instances continue enforcing with the last-known ratio, never fail open (allow all) or closed (deny all), preventing cascading outages
The classic high-RPS rate-limiter production incident. A team starts with Redis-based token bucket rate limiting. At ~50K RPS, the Redis cluster saturates — p99 latency climbs into the tens of ms, CPU hits 100%, the rate limiter becomes the bottleneck it was supposed to prevent. We debugged this incident on multiple production teams. [Google Doorman]
Coordinator-based rate limiting fails at scale because every request needs a round trip[Beyer et al., 2016]. The solution, popularised by Google's Doorman[Google Doorman] and similar designs: eliminate the coordinator from the hot path. Move enforcement to the edge with an asynchronously-pushed directive — clients enforce locally with a cached drop ratio, the controller never touches the request path.
graph LR
subgraph T1["Tier 1: Client (per service instance)"]
C1[Allow] -->|local in-memory<br/>~100ns| Coin[Biased coin flip]
Coin -->|rand < ratio| Drop[Drop]
Coin -->|else| Pass[Pass]
end
subgraph T2["Tier 2: Aggregator (zonal)"]
A1[Per-bucket counts] --> A2[Roll up + push directive]
end
subgraph T3["Tier 3: Controller"]
Ctrl["Compute drop_ratio<br/>= (actual - limit) / actual"]
end
Pass -.->|count| A1
Drop -.->|count| A1
A2 -->|every 1–2s| Ctrl
Ctrl -->|directive push| C1
style Drop fill:#fee
style Pass fill:#efe
The hot path (left subgraph) is local-only. The control loop (centre + right) runs asynchronously — coordinator latency or failure can't stall request handling.
Probabilistic drop decouples enforcement from coordination[Google Doorman]. Each instance makes local allow/deny decisions using a drop ratio computed asynchronously by a central controller. The Allow() call is a pure in-memory operation — zero network overhead, graceful degradation when the controller fails.
- Drop ratio formula: (actual - limit) / actual; every instance flips a biased coin independently
- Directives pushed asynchronously every 1–2 seconds; lag is acceptable for overload protection
- Coordinator failure: instances keep enforcing with the last-known ratio, never fail open or closed
Rate-Limiting Algorithm Comparison
[Redis Docs]| Approach | Throughput ceiling | Coordination | Per-request cost | Accuracy | Best for |
|---|---|---|---|---|---|
| In-memory token bucket (per-pod) | Pod CPU bound | None | ~50 ns | Per-pod only | Single-replica services |
| Redis token bucket (Lua atomic) | ~50K RPS | High — Redis-bound | 0.5-2 ms RTT | Exact global | Billing, exact counting, low concurrency |
| Redis sliding window | ~30K RPS | High | 1-3 ms RTT | Exact global | Per-tenant SLA enforcement |
| Probabilistic drop (this article) | over 1 million RPS | None on hot path | ~100 ns local flip | Statistical (~1% drift) | Overload protection at scale |
| Doorman / centralised quota | depends on coordinator | Heavy | 1-5 ms | Exact within rebalance window | Long-lived clients with quota leases |
| Client-side adaptive (e.g. circuit-breaker hybrid) | over 1 million RPS | None | ~50 ns | Server-driven, lossy | Browser / mobile clients |
Redis scales linearly with load: at 80K RPS, you need 80K commands/second — already past Redis's ~50K practical ceiling for Lua-based limiting. At high concurrency, the Redis connector becomes the bottleneck. Probabilistic drop removes the coordinator from the request path: Allow() is a pure in-memory decision using a cached directive.
How the Drop Ratio Works
[Google Doorman]The three-tier feedback loop converges global throughput to the limit without coordinating per-request:
graph LR
Req[Incoming request] --> C[Tier 1 — Client<br/>biased coin flip<br/>local in-memory]
C -->|allowed| App[Application]
C -->|dropped| Reject[429]
App -->|sample stats async| A[Tier 2 — Aggregator<br/>async loop<br/>1-2s window]
A -->|global RPS| Ctrl[Tier 3 — Controller<br/>computes drop_ratio]
Ctrl -->|push directive<br/>every 1-2s| C
style Reject fill:#fdd
style App fill:#dfd
style Ctrl fill:#ffd
The controller computes: drop_ratio = (actual - limit) / actual
Example: limit = 1,000 RPS, actual = 1,200 RPS → drop_ratio = 0.167 (drop 16.7% of requests). [Google Doorman]
Each instance independently flips a biased coin: if rand() < drop_ratio { drop() }. Because all instances use the same ratio, the aggregate effect brings global throughput to the limit:
global_actual × (1 - drop_ratio) ≈ limit
Uncoordinated: no instance talks to another during a request. Directives are pushed asynchronously every 1–2 seconds, which is acceptable for overload protection (not billing or compliance).
Tier 1: The Client (Allow)
package ratelimit
import (
"math/rand/v2"
"sync"
"sync/atomic"
"time"
)
type Bucket string
type Directive struct {
DropRatio float64
UpdatedAt time.Time
}
type Client struct {
mu sync.RWMutex
directives map[Bucket]Directive
counters map[Bucket]*atomic.Int64
reportCh chan<- map[Bucket]int64
}
func NewClient(reportCh chan<- map[Bucket]int64) *Client {
return &Client{
directives: make(map[Bucket]Directive),
counters: make(map[Bucket]*atomic.Int64),
reportCh: reportCh,
}
}
// Allow returns true if the request should be processed.
// Safe for concurrent use, never blocks.
func (c *Client) Allow(b Bucket) bool {
c.counter(b).Add(1)
c.mu.RLock()
d, ok := c.directives[b]
c.mu.RUnlock()
if !ok || d.DropRatio <= 0 {
return true
}
if d.DropRatio >= 1.0 {
return false
}
return rand.Float64() >= d.DropRatio
}
// UpdateDirective installs a new directive from the aggregator.
func (c *Client) UpdateDirective(b Bucket, d Directive) {
c.mu.Lock()
c.directives[b] = d
c.mu.Unlock()
}
func (c *Client) counter(b Bucket) *atomic.Int64 {
c.mu.RLock()
counter, ok := c.counters[b]
c.mu.RUnlock()
if ok {
return counter
}
c.mu.Lock()
defer c.mu.Unlock()
if counter, ok = c.counters[b]; ok {
return counter
}
counter = &atomic.Int64{}
c.counters[b] = counter
return counter
}
// StartReporting flushes counters every interval.
func (c *Client) StartReporting(ctx interface{ Done() <-chan struct{} }, interval time.Duration) {
go func() {
ticker := time.NewTicker(interval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
c.flush()
}
}
}()
}
func (c *Client) flush() {
c.mu.Lock()
batch := make(map[Bucket]int64)
for b, counter := range c.counters {
if count := counter.Swap(0); count > 0 {
batch[b] = count
}
}
c.mu.Unlock()
if len(batch) == 0 {
return
}
select {
case c.reportCh <- batch:
default:
// Don't block if aggregator is slow
}
}This is all you need: Allow() checks directives in-memory, never touches the network. UpdateDirective() is called asynchronously by the aggregator.
Common Gotchas
- Directive lag: Directives are 1–2 seconds stale. For overload protection this is fine; for billing or compliance, use Redis-based exact counting instead.
- Uneven drops per instance: Expected. One instance may drop 16%, another 18%. The aggregate converges to the limit.
- New buckets: A new bucket with no controller history won't have a directive. The client allows all requests until the first directive arrives.
- Coordinator failure: The last-known directive stays active. Enforcement becomes approximate but never fails open. Use shadow mode to validate limits before enabling.
- Counter overflow: At >1B RPS per bucket, atomic counters wrap. Use uint64 or reset counters more frequently. [Google Doorman]
Tier 2: The Aggregator (async loop)
The aggregator receives counter batches from clients, computes global rates, and pushes directives back:
type Controller interface {
ReportStats(ctx context.Context, zone string, counts map[Bucket]int64) (map[Bucket]int64, error)
}
type Aggregator struct {
zone string
controller Controller
reportCh <-chan map[Bucket]int64
clients []*Client
mu sync.RWMutex
window map[Bucket]int64 // rolling 1-second count per bucket
}
func (a *Aggregator) Run(ctx context.Context) {
ticker := time.NewTicker(500 * time.Millisecond)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case batch := <-a.reportCh:
a.mu.Lock()
for b, count := range batch {
a.window[b] += count
}
a.mu.Unlock()
case <-ticker.C:
a.mu.RLock()
counts := a.window
a.window = make(map[Bucket]int64)
a.mu.RUnlock()
quotas, err := a.controller.ReportStats(ctx, a.zone, counts)
if err != nil {
continue // keep existing directives
}
// Compute and push directives
for b, limit := range quotas {
actual := counts[b]
var dropRatio float64
if actual > limit && actual > 0 {
dropRatio = float64(actual-limit) / float64(actual)
}
for _, c := range a.clients {
c.UpdateDirective(b, Directive{DropRatio: dropRatio, UpdatedAt: time.Now()})
}
}
}
}
}Runs in a sidecar or separate goroutine. The controller is your rate limit source-of-truth: it returns per-bucket limits based on SLOs or historical baseline.
Tier 3: The Controller (limit computation)
[Beyer et al., 2016]// ComputeLimit: measure baseline over 7 days, set limit at 120% of p95 daily peak.
func (c *Controller) ComputeLimit(b Bucket) int64 {
samples := c.history[b] // []float64, last 7 days
if len(samples) < 3 {
return 0 // no limit until history available
}
sort.Float64s(samples)
idx := int(float64(len(samples)) * 0.95)
if idx >= len(samples) {
idx = len(samples) - 1
}
return int64(samples[idx] * 1.20)
}This is optional. You can hardcode limits if your SLOs are stable.
Production Checklist
- Allow() microbenchmark: should be
<200nsper request (in-memory only) - Directive update latency: < 100ms from controller decision to all clients
- Shadow mode: run 48 hours, monitor shadow_drop metric; if >5% of traffic, limits are too tight
- Coordinator failure: verify clients keep enforcing with the last-known directive
- Counter reset: ensure counters don't wrap at high RPS (use uint64, reset every 1-2 seconds)
- Middleware integration: wire Allow() into your HTTP/gRPC stack, return 429 + Retry-After [Google Doorman]
When to use probabilistic drop: overload protection (cascade prevention), high throughput (>50K RPS), acceptable eventual consistency (1–2s lag). When to use Redis: billing, compliance, exact request counting.
Microbenchmark, shadow mode, and the 429 contract
The Allow() microbenchmark — set this as a regression gate so a refactor that adds a syscall to the hot path fails CI:
func BenchmarkClient_Allow(b *testing.B) {
bucket := Bucket("tenant:t:resource:r")
c := NewClient(make(chan map[Bucket]int64, 1))
c.UpdateDirective(bucket, Directive{DropRatio: 0.3, UpdatedAt: time.Now()})
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = c.Allow(bucket)
}
// Acceptable: <200ns/op, 0 allocs/op. Anything above 500ns means a
// syscall or allocation crept into the hot path.
}Shadow-mode middleware — runs the limiter and emits the metric but never rejects. Use this to measure what production traffic would be rejected before you flip the kill switch:
type Mode int
const (
ModeShadow Mode = iota // observe-only; never returns 429
ModeEnforce // production
)
func RateLimitMiddleware(c *Client, mode Mode, m *prometheus.CounterVec, next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
bucket := bucketFor(r) // e.g. Bucket("tenant:" + tenantID(r))
if !c.Allow(bucket) {
m.WithLabelValues("shadow_drop").Inc()
if mode == ModeEnforce {
w.Header().Set("Retry-After", "1")
w.Header().Set("X-RateLimit-Reason", "tenant_quota_exceeded")
http.Error(w, "rate limited", http.StatusTooManyRequests)
return
}
}
next.ServeHTTP(w, r)
})
}A 48-hour shadow run before enforcement is non-negotiable — if shadow_drop exceeds 5% of traffic with limits you thought were generous, your limits are too tight and would have caused a customer incident on day one. Tune against the shadow metric, then flip to ModeEnforce. [Google Doorman]
The controller circuit breaker — when the directive feed stalls, freeze enforcement at the last-known directive instead of failing open or closed. This is the canonical UpdateDirective from Tier 1, hardened with a staleness guard (same sync.RWMutex-protected directives map, plus a metrics counter on the struct):
func (c *Client) UpdateDirective(b Bucket, d Directive) {
if time.Since(d.UpdatedAt) > 30*time.Second {
// Reject obviously stale directives — controller is misbehaving.
// Keep the last good directive; the in-memory enforcement is
// approximate but never fails open or closed.
c.staleDirectives.Add(1)
return
}
c.mu.Lock()
c.directives[b] = d
c.mu.Unlock()
}The 429 response contract every limiter rejection should honour — Retry-After (seconds-or-HTTP-date) plus the per-bucket counters so the client can self-throttle without retrying blindly:
HTTP/1.1 429 Too Many Requests
Content-Type: application/problem+json
Retry-After: 1
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714579200
X-RateLimit-Reason: tenant_quota_exceeded
{
"type": "https://api.example.com/errors/rate-limited",
"title": "Rate limit exceeded",
"detail": "Tenant 't_42' exceeded 1000 requests/sec; retry in 1s.",
"status": 429
}The single most-skipped header is X-RateLimit-Reset — without it, well-behaved SDKs cannot back off precisely and end up either polling too aggressively (turning the 429 into sustained load) or sleeping too long (exaggerating the customer-visible latency).
A client SDK snippet that consumes the headers correctly — sleep until X-RateLimit-Reset, not a fixed exponential backoff that ignores the server's hint:
func (c *Client) Do(req *http.Request) (*http.Response, error) {
resp, err := c.http.Do(req)
if err != nil { return nil, err }
if resp.StatusCode == 429 {
resp.Body.Close()
if reset := resp.Header.Get("X-RateLimit-Reset"); reset != "" {
ts, _ := strconv.ParseInt(reset, 10, 64)
time.Sleep(time.Until(time.Unix(ts, 0)))
} else {
time.Sleep(parseRetryAfter(resp.Header.Get("Retry-After")))
}
return c.http.Do(req)
}
return resp, nil
}Composing Probabilistic Drop With Redis-Based Exact Counting
Probabilistic drop is the right primitive for overload protection — it cannot be the only primitive in a serious production system. Billing meters, free-tier quotas, fraud thresholds, and abuse mitigation all require exact counting that converges on a single number rather than a statistical mean. The composition pattern that consistently survives audit is a two-layer pipeline: probabilistic drop in front, Redis-based token bucket behind it, with each layer optimised for the question it actually answers.
The front layer answers "is the cluster overloaded right now". The back layer answers "has tenant X consumed their hourly quota". The front layer fires on every request at ~100ns; the back layer only fires on the survivors of the front layer, which is a fraction of nominal traffic and almost zero during incidents. This inverts the cost curve of a naive Redis-only setup: when traffic spikes, Redis sees less load, not more, because the probabilistic layer absorbs the surplus before it ever reaches the coordinator.
type LayeredLimiter struct {
fast *Client // tier-1 probabilistic drop
exact *RedisTokenBucket // tier-2 exact counting
metrics *prometheus.CounterVec
}
func (l *LayeredLimiter) Allow(ctx context.Context, b Bucket) (bool, string) {
if !l.fast.Allow(b) {
l.metrics.WithLabelValues("drop_fast").Inc()
return false, "cluster_overload"
}
ok, err := l.exact.Allow(ctx, b)
if err != nil {
// Redis unavailable — fall back to fast-layer decision only.
// We already passed the cluster overload check, so allow.
l.metrics.WithLabelValues("redis_degraded").Inc()
return true, "redis_degraded_passthrough"
}
if !ok {
l.metrics.WithLabelValues("drop_quota").Inc()
return false, "tenant_quota_exceeded"
}
return true, ""
}The reason string is what the 429 contract returns in X-RateLimit-Reason — clients can tell the difference between "you hit your billing quota" (don't retry, talk to sales) and "the cluster is on fire" (retry with backoff). Lumping them together is the most common mistake teams make when bolting Redis onto an existing probabilistic drop deployment.
Multi-Tenancy Fairness With Weighted Fair Queueing
Probabilistic drop with a single global drop ratio treats every tenant as interchangeable, which is fine for a closed internal system and a disaster for a paid SaaS. A noisy free-tier tenant can crowd out a paying enterprise tenant when they collide in the same overloaded zone, and the only signal the limiter gives back is "everyone gets dropped at 18%". The fix is per-tenant weights pushed alongside the drop ratio, so each tenant's effective drop rate becomes a function of its share of the overage rather than a flat slice across the cluster. [Google Doorman]
type TenantDirective struct {
DropRatio float64
Weight float64 // 1.0 = fair share, 4.0 = enterprise, 0.25 = free
UpdatedAt time.Time
}
func (c *Client) AllowTenant(b Bucket, tenant string) bool {
c.counter(b).Add(1)
d := c.tenantDirective(b, tenant)
if d.DropRatio <= 0 {
return true
}
// Weighted drop: enterprise tenants drop less, free tenants drop more.
effective := d.DropRatio / d.Weight
if effective >= 1.0 {
return false
}
return rand.Float64() >= effective
}The controller computes weights from a tenant catalogue — typically billing tier, contract SLA, and a manually-set boost flag for tenants currently being courted by sales. Surface the weight table as a YAML resource the controller hot-reloads, so support engineers can adjust it during an incident without a code deploy:
tenants:
default:
weight: 1.0
burst_multiplier: 1.5
enterprise_acme:
weight: 4.0 # 4x fair share during overload
burst_multiplier: 3.0 # tolerate 3x sustained rate before drop
notes: "Q2 contract, escalations to alex@"
free_tier:
weight: 0.25 # 4x more aggressive drop than fair
burst_multiplier: 1.0
partner_internal:
weight: 2.0
burst_multiplier: 2.0
fairness:
algorithm: deficit_round_robin
reweight_interval_s: 5
min_floor_rps: 5 # never drop a tenant below 5 rpsThe min_floor_rps floor is non-obvious but matters: without it, a free-tier tenant making one health-check call per minute can be dropped at 99% during overload because their rate is statistically indistinguishable from noise. A floor below which probabilistic drop is disabled keeps customer-visible signals like login flows working through the worst-case incident. [Google Doorman]
Real Production Deployment: Shadow, Canary, Enforce
Rolling probabilistic drop into a production system that previously had no global rate limiter — or one based on Redis that you are replacing — is the kind of change that breaks customers if you sequence it wrong. The deployment pattern that consistently lands without a customer incident is a three-stage rollout with explicit metric gates between each stage and a documented rollback signal.
Stage one is shadow mode for 7 days. The limiter runs end-to-end — directives flow, counters increment, the middleware emits shadow_drop — but the rejection branch is dead code. The gate to advance is shadow_drop / total_requests < 0.005 sustained over a 24-hour window; anything above 0.5% means the limit table is wrong and would have caused a measurable customer impact on day one. [Google Doorman]
Stage two is canary enforcement on 5% of traffic for 48 hours. The same directive feed, but a fraction of pods flip to ModeEnforce. The gates are tighter: 429 rate must stay below 0.1% of total requests on the canary cohort, p99 latency on the canary must not exceed control by more than 5ms, and the customer support ticket volume on the limited tenants must not increase week-over-week. Any breach reverts the canary cohort to shadow mode and pages the on-call. [Google Doorman]
# Shadow → canary gate: shadow drop rate must be under 0.5%
sum(rate(ratelimit_decision_total{result="shadow_drop"}[1h]))
/
sum(rate(ratelimit_decision_total[1h]))
< 0.005
# Canary → full enforce gate: canary 429s under 0.1% with no latency regression
(
sum by (cohort) (rate(http_requests_total{cohort="canary",code="429"}[15m]))
/
sum by (cohort) (rate(http_requests_total{cohort="canary"}[15m]))
) < 0.001
and
(
histogram_quantile(0.99, sum by (le, cohort) (rate(http_request_duration_seconds_bucket{cohort="canary"}[15m])))
-
histogram_quantile(0.99, sum by (le, cohort) (rate(http_request_duration_seconds_bucket{cohort="control"}[15m])))
) < 0.005
# Rollback signal: directive feed staleness > 60s on any zone
max by (zone) (time() - ratelimit_directive_last_update_timestamp_seconds) > 60Stage three is full enforcement, but with a kill switch. The kill switch is a single boolean in the controller config that flips every client back to ModeShadow within one directive cycle (~2s). On the team that ran the original incident this article opens with, the kill switch was hit twice in the first month: once during a marketing campaign that legitimately tripled traffic (limits were too low), once during a Redis incident on a downstream service that caused upstream retry storms. Both times it bought the on-call enough breathing room to fix the underlying issue without the rate limiter becoming the visible failure.
The metric every team forgets to wire up is directive feed staleness. A frozen controller that keeps shipping the same directive looks healthy from the client's perspective — the last-known directive is still valid — until traffic shape shifts and the stale ratio is wildly wrong for current load. Alert on time_since_last_directive_update > 60s per zone, not just on controller error rate, and the on-call will catch the silent failure mode before it turns into an incident.
Frequently Asked Questions
Why does Redis-based rate limiting fail at high scale?
At 80K RPS across 10 services, you need 80K Redis commands per second. Network latency (0.5–2ms per check) and Redis CPU saturation turn the rate limiter into the bottleneck it was supposed to prevent.
What is probabilistic drop?
Each service instance makes local allow/deny decisions by randomly dropping a percentage of requests based on a drop_ratio. The ratio is computed by a central controller and pushed asynchronously — the Allow() call is purely local.
How does the three-tier architecture work?
Tier 1 (instances) make local decisions using cached drop_ratio. Tier 2 (zone aggregators) collect counts and push directives. Tier 3 (controller) computes global rates and distributes quota. No tier blocks the request path.
What happens if the coordinator fails?
Instances continue using the last-known drop_ratio. Enforcement becomes approximate but never fails open (allow all) or closed (deny all), unlike Redis-based designs.
Keep Reading
- Rate Limiter Algorithms: Token Bucket vs Sliding Window — Single-node algorithms: token bucket, sliding window, and the Redis Lua scripts that power coordinator-based limiting
- Building Resilient Distributed Systems with Go — Circuit breakers and bulkheads complement rate limiting; together they form the full overload protection stack
- Go Worker Pool Pattern: Production-Ready Concurrency Control — The concurrency primitives behind the client's lock-free Allow() call
Engineering Team
A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.
Read Next
Consistent Hashing: The Algorithm Behind Every Scalable Distributed System
Adding one cache server shouldn't invalidate every key. Consistent hashing with virtual nodes and bounded loads — full Go and Java implementations.
Rate Limiter Algorithms: Token Bucket vs Sliding Window
Five rate limiting algorithms, their trade-offs, how to distribute them across a fleet, and client-side backoff that works.
Kafka vs RabbitMQ vs NATS vs SQS: Choosing the Right Message Broker
Kafka vs RabbitMQ vs NATS vs SQS: delivery semantics, ordering, throughput, ops complexity, and a decision framework with Go code.