Rate Limiter Algorithms: Token Bucket vs Sliding Window
Key Takeaways
- →Token bucket allows bursts up to capacity while enforcing sustained rate — ideal for bursty clients (mobile apps waking up); fixed window has sharp boundary spikes
- →Sliding window counter uses O(1) memory with 1-2% error (approximates logs); sliding window log uses O(n) timestamps but is precise — choose based on memory vs precision trade-off
- →IP-based rate limiting fails: enterprise users behind NAT share single IP (false positives), attackers rotate 1000s of IPs (false negatives) — use authenticated user ID or API key instead
- →HTTP 429 + Retry-After + X-RateLimit headers let well-behaved clients implement exponential backoff; without these, clients keep hammering, making attacks worse
At 2 AM on a Tuesday, an authentication service received 50,000 login attempts in 60 seconds from 2,000 rotating IP addresses. Credential stuffing — an attacker running leaked username/password pairs from a previous breach. No rate limiting was in place. The database absorbed the full load. Response times climbed from 40ms to 12 seconds. Legitimate users got timeout errors.
The first rate limiter failed: a naive fixed-window counter at 100 requests per minute per IP. The attacker adapted in hours — spread load across 1,000 IPs, kept each under the limit. The second attempt used sliding window per user ID plus IP limits for unauthenticated endpoints. That stopped the credential stuffing but broke enterprise customers behind NAT sharing a single IP.
Rate limiting is three problems: algorithm choice, key selection, and distributed coordination. Get any wrong and you block legitimate traffic or let attackers through. Below: the five algorithms with exact math, production code for each, the key-selection trap that breaks real deployments, and how to keep your rate limiter alive when Redis fails.
Choose your algorithm by answering two questions: Do legitimate clients burst? (yes → token bucket; no → sliding window counter). Rate limit by user ID or API key, not IP. Distribute rate limiting via Redis + Lua scripts, but fail back to local token buckets when the store goes down.
- Use token bucket for burst-tolerant APIs; sliding window counter for precise enforcement
- Rate limit by authenticated identity first, IP only for unauthenticated endpoints
- Store state in Redis with atomic Lua scripts; local in-memory buckets as failover
The Quick Start
| Algorithm | Burst allowed | Memory | Precision | Use case |
|---|---|---|---|---|
| Token Bucket | Yes (up to capacity) | O(1) per key | Approximate (smooths input) | Stripe, AWS, most public APIs — accommodates bursty traffic |
| Leaky Bucket | No (queue, then drain) | O(n) for queue depth | Precise (smooths output) | SMS/email providers — need constant downstream rate |
| Fixed Window | Yes (window boundary) | O(1) per key | Low (resets sharply) | Internal APIs, non-critical paths — simplest, but has edge-case spike |
| Sliding Window Log | Yes | O(n) timestamps | Very high (second precision) | Payment gateways, auth endpoints — no false negatives |
| Sliding Window Counter | No | O(1) per key | High (1-2% error) | Large-scale APIs (billions of keys) — approximates log without memory cost |
Algorithm 1: Token Bucket
The token bucket[Google Doorman] refills at a fixed rate and allows clients to burst up to the bucket capacity. Each request costs one token. A client idle for 30 seconds can immediately send 30 seconds worth of requests, then settles into the sustained rate. This is the most permissive algorithm — clients with legitimate bursts (mobile apps waking up, batch processors, connection-pooled clients) can absorb their spikes without hitting the limit.
The mental model: imagine a bucket that fills up to a maximum level. A worker empties tokens at a constant rate. Every request removes one token. Empty bucket = request rejected.
graph TD
Req["Incoming Request"] --> Refill["Refill tokens<br/>elapsed × rate"]
Refill --> Check{"tokens ≥ 1?"}
Check -->|Yes| Decr["Decrement token"]
Decr --> Allow["✓ Allow request"]
Check -->|No| Reject["✗ Reject<br/>429 + Retry-After"]
type TokenBucket struct {
tokens float64
capacity float64
refillRate float64 // tokens per second
lastRefill time.Time
mu sync.Mutex
}
func NewTokenBucket(capacity, refillRate float64) *TokenBucket {
return &TokenBucket{
tokens: capacity,
capacity: capacity,
refillRate: refillRate,
lastRefill: time.Now(),
}
}
func (tb *TokenBucket) Allow(n float64) (bool, time.Duration) {
tb.mu.Lock()
defer tb.mu.Unlock()
now := time.Now()
elapsed := now.Sub(tb.lastRefill).Seconds()
tb.tokens = min(tb.capacity, tb.tokens+elapsed*tb.refillRate)
tb.lastRefill = now
if tb.tokens >= n {
tb.tokens -= n
return true, 0
}
// Return time until n tokens available for Retry-After header
needed := n - tb.tokens
return false, time.Duration(needed/tb.refillRate) * time.Second
}Key math: A bucket with capacity 100 and refill rate 10/s allows a burst of 100 requests instantly, then sustains 10 req/s. Stripe uses token bucket for their payment API rate limiting. AWS API Gateway defaults to token bucket. The key insight: set capacity based on the burst you'll tolerate, not the sustained rate. A 100ms spike of 50 requests? Set capacity to 50. A 1-second spike? Set capacity to your req/s rate. This algorithm wins because real traffic is bursty, and token bucket acknowledges that instead of punishing it.
Algorithm 2: Leaky Bucket
[Redis Docs]The leaky bucket queues requests and drains them at a constant rate. Unlike token bucket, bursts are delayed (queued), not rejected. When the queue fills, new requests are rejected. Use only when output smoothing is mandatory: SMS providers can't handle 1,000 messages/sec, payment gateways need smooth load for PCI compliance. For most APIs, leaky bucket's forced client-side latency is overly restrictive.
type LeakyBucket struct {
queue chan struct{}
ratePerSec int
}
func (lb *LeakyBucket) Allow() bool {
select {
case lb.queue <- struct{}{}:
return true
default:
return false // Queue full
}
}Algorithm 3: Fixed Window Counter
Count requests in a fixed time window. When the window resets, the counter resets. Simplest to implement: one integer per key. But it has a sharp edge case called the boundary spike: if a client sends all their requests at the window boundary, they can exceed the intended rate.
type FixedWindow struct {
limit int
window time.Duration
counter int
windowEnd time.Time
mu sync.Mutex
}
func NewFixedWindow(limit int, window time.Duration) *FixedWindow {
return &FixedWindow{
limit: limit,
window: window,
windowEnd: time.Now().Add(window),
}
}
func (fw *FixedWindow) Allow() bool {
fw.mu.Lock()
defer fw.mu.Unlock()
now := time.Now()
if now.After(fw.windowEnd) {
fw.counter = 0
fw.windowEnd = now.Add(fw.window)
}
if fw.counter < fw.limit {
fw.counter++
return true
}
return false
}The spike problem: With a 100 req/min limit in a 60-second window, a client can send 100 requests at second 59 (last second of window), then 100 more at second 60 (first second of new window), effectively doubling the rate to 200 req/min for that 2-second window. This spike can overload backends that expect steady traffic. Acceptable for non-critical internal APIs; risky for user-facing endpoints or auth/payment flows where you need stronger guarantees.
Algorithm 4: Sliding Window Log
Store the exact timestamp of each request in a sorted set (Redis ZSET[Redis Docs]). To check if a request is allowed, remove timestamps older than the window, count remaining requests, and compare to the limit. Very accurate but O(n) memory per key.
// Redis Lua script for atomic sliding window log check
const slidingWindowLogScript = `
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window_ms = tonumber(ARGV[2])
local now_ms = tonumber(ARGV[3])
local window_start = now_ms - window_ms
redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)
local current = redis.call('ZCARD', key)
if current < limit then
redis.call('ZADD', key, now_ms, now_ms)
redis.call('EXPIRE', key, math.ceil(window_ms / 1000))
return 1 -- Allow
else
return 0 -- Reject
end
`Advantages: No false negatives — if a request is allowed in the sliding window, it will remain allowed. Zero enforcement error. Perfect for payment processing (no illegitimate transactions slip through) and strict regulatory endpoints. Disadvantage: memory scales with request volume. For a key handling 1,000 req/min, you store 1,000 timestamps. For billions of keys, this is expensive. Only use when the precision is worth the cost.
Algorithm 5: Sliding Window Counter
Approximate sliding window log by tracking two fixed-window counters: current and previous. Weight the previous window's count by the fraction of the current window that overlaps it. This gives you ~99% accuracy of sliding window log with O(1) memory per key.
The math: at time t inside the current window, the observed rate is
curr + prev * ((windowSize - elapsed) / windowSize). When the clock crosses the window boundary, curr becomes prev and a new curr = 0 starts.
graph LR
subgraph PW["Previous window"]
direction TB
P["count = prev<br/>(e.g. 80)"]
end
subgraph CW["Current window<br/>(elapsed since currStart)"]
direction TB
C["count = curr<br/>(e.g. 25)"]
end
PW -.weighted<br/>overlap.-> Rate["Observed rate<br/>= curr + prev × (1 − elapsed/window)<br/>= 25 + 80 × 0.6 = 73"]
CW --> Rate
Rate --> Decide{"rate < limit?"}
Decide -->|Yes| Allow["✓ Allow"]
Decide -->|No| Reject["✗ Reject"]
Two integers per key replace a potentially-unbounded timestamp log. Trade: the weighting assumes the previous window's load was uniform; bursty traffic near the boundary can under- or over-count by a few percent.
type SlidingCounter struct {
limit int
window time.Duration
curr int
prev int
currStart time.Time
mu sync.Mutex
}
func NewSlidingCounter(limit int, window time.Duration) *SlidingCounter {
return &SlidingCounter{
limit: limit,
window: window,
currStart: time.Now(),
}
}
func (sc *SlidingCounter) Allow() bool {
sc.mu.Lock()
defer sc.mu.Unlock()
now := time.Now()
elapsed := now.Sub(sc.currStart)
if elapsed > sc.window {
sc.prev = sc.curr
sc.curr = 0
sc.currStart = now
elapsed = 0
}
// Weight previous window by overlap fraction
fraction := 1.0 - (float64(elapsed) / float64(sc.window))
weighted := float64(sc.prev) * fraction
if weighted+float64(sc.curr) < float64(sc.limit) {
sc.curr++
return true
}
return false
}The math: if 70% of the current 60-second window has elapsed and the previous window had 80 requests with a 100 req/min limit, we weight the previous count as 80 × 0.30 = 24 (30% of the 60s overlaps the previous window). Combined with current count, the decision is precise to within ~1-2%. This ~1% error is acceptable for most APIs and buys you O(1) memory. Large-scale CDNs and API gateways use this approach for rate limiting billions of keys because memory is constant regardless of traffic volume. [Google Doorman]
Distributed Rate Limiting
Single-machine rate limiting breaks at scale when 50 pods each enforce locally — attackers hit each independently. You need shared state via Redis with atomic Lua scripts[Redis Docs], plus a fallback to conservative local token buckets when Redis is down. For latency-critical APIs, use local in-memory limiters with periodic async sync to Redis, accepting ~1% enforcement error to avoid adding 5ms per request. For deeper patterns, see Scaling Redis for High-Throughput Workloads and the probabilistic-drop variant.
What to Rate Limit By
[OWASP Top 10, 2021]The key selection problem breaks more deployments than the algorithm choice. Choose the wrong key and you'll either block legitimate traffic or let attackers through.
Primary key: authenticated identity (user ID, API key). This is always available for authenticated endpoints and is the only reliable limit. Set per-user limits based on their subscription tier: free tier 100 req/min, professional 1,000 req/min, enterprise 10,000 req/min. This way, if one user is compromised, the damage is limited to that user's quota, not the entire service.
Secondary key: IP address (only for unauthenticated endpoints). Understand its failure modes:
- False positives: Enterprise users behind NAT share a single IP. A company with 500 employees on the same network can hit an IP-based limit of 10 req/min in seconds, all legitimate traffic.
- False negatives: Attackers rotate IPs. Our credential stuffing attack came from 2,000 rotating IPs — each one stayed under the per-IP limit.
Use IP limits defensively: catch obvious attack patterns (scanning 1,000 endpoints in 10 seconds, or 50 login attempts in 5 seconds) but don't rely on them as primary enforcement.
Tiered limiting (the working model): Combine both. Limit per-user generously (e.g., 1,000 req/min for authenticated users) and per-IP tightly (e.g., 10 req/min for unauthenticated). When a user hits the per-IP limit on /login, they simply authenticate first (switching to the higher per-user limit). When an attacker rotates IPs, each IP hits a low limit. This trades some false positives (enterprises) for false negatives (attackers), and the false positives have a workaround (authentication).
Client-Side Rate Limiting
[Beyer et al., 2016]When your client hits a 429 response, implement exponential backoff with full jitter to spread retry attempts across the window:
func exponentialBackoffWithJitter(attempt int) time.Duration {
// Cap at 60 seconds
maxBackoff := time.Duration(math.Min(60, math.Pow(2, float64(attempt)))) * time.Second
// Full jitter: random delay between 0 and maxBackoff
jitter := time.Duration(rand.Int63n(maxBackoff.Milliseconds())) * time.Millisecond
return jitter
}Why full jitter matters: Without jitter, all clients retrying at the same time create a thundering herd. Full jitter spreads retries uniformly, smoothing the load spike. Use the Retry-After header if provided; otherwise calculate your own.
For internal APIs you control, implement a client-side token bucket that prevents sending requests you know will be rejected:
func (c *Client) Do(ctx context.Context, req *http.Request) (*http.Response, error) {
// Block until a token is available; respects context cancellation
for !c.bucket.Allow(1) {
waitTime := c.bucket.TimeUntilAllowed(1)
select {
case <-time.After(waitTime):
case <-ctx.Done():
return nil, ctx.Err()
}
}
return http.DefaultClient.Do(req)
}This eliminates round-trip costs for requests that would be rejected anyway.
Production Checklist
- Choose algorithm based on burst tolerance: token bucket for bursty clients (mobile, batch jobs), sliding window counter for precise enforcement at scale
- Rate limit by authenticated identity (user ID, API key) first; use IP limits defensively only for unauthenticated endpoints
- Store counters in Redis with atomic Lua scripts; distribute via shared state, not local counts
- Return HTTP 429 with Retry-After, X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers
- Implement exponential backoff with full jitter (0 to 2^attempt seconds) on the client side
- Deploy a local token bucket failover when Redis is unreachable — fail open (reduce limits) rather than fail closed (reject everything)
- Monitor token bucket age to detect when rate limit keys have expired; replenish if needed
- Set alerts when 429 rate exceeds 1% of traffic (early sign of abuse or misconfiguration) [Google Doorman]
The Redis Lua script every distributed limiter uses
Token-bucket-in-Redis only works if the read-modify-write is atomic. The naïve GET + math + SET loses the race; EVAL with the Lua script below executes inside the Redis single-thread, so concurrent consumers cannot double-spend tokens:
-- token_bucket.lua — atomic refill + consume
-- KEYS[1] = bucket key (e.g. "rl:user:42")
-- ARGV[1] = capacity (max tokens)
-- ARGV[2] = refill_rate (tokens per second)
-- ARGV[3] = now (current unix time, ms)
-- ARGV[4] = cost (tokens to consume; usually 1)
-- Returns: { allowed (1/0), tokens_remaining, retry_after_ms }
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now_ms = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
local data = redis.call("HMGET", KEYS[1], "tokens", "last_refill_ms")
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now_ms
local elapsed_ms = math.max(0, now_ms - last_refill)
local refill = (elapsed_ms / 1000.0) * refill_rate
tokens = math.min(capacity, tokens + refill)
local allowed = tokens >= cost and 1 or 0
local retry_ms = 0
if allowed == 1 then
tokens = tokens - cost
else
retry_ms = math.ceil(((cost - tokens) / refill_rate) * 1000.0)
end
redis.call("HMSET", KEYS[1], "tokens", tokens, "last_refill_ms", now_ms)
redis.call("PEXPIRE", KEYS[1], math.ceil((capacity / refill_rate) * 1000) * 2)
return { allowed, math.floor(tokens), retry_ms }The matching Prometheus alerts that catch limiter-driven incidents — abuse spikes, Redis outages forcing a fail-open, and key-cardinality drift that explodes memory:
# alerts/rate-limiter.yml
groups:
- name: rate-limiter
rules:
- alert: HighRejectionRate
expr: |
sum(rate(http_requests_total{status="429"}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.01
for: 3m
labels: { severity: warning }
annotations:
summary: "429 rate >1% of traffic"
description: "Either abuse or limiter misconfiguration. Check top-N rate-limited identities."
- alert: LimiterFailedOpen
expr: rate(rate_limiter_redis_errors_total[5m]) > 0
for: 1m
labels: { severity: critical }
annotations:
summary: "Rate limiter cannot reach Redis — failing open"
description: "Local-token-bucket fallback is active. Restore Redis before quota abuse compounds."
- alert: LimiterKeyCardinalityHigh
expr: |
(count(count by (limiter_key) (rate_limiter_requests_total)) > 500000)
for: 15m
labels: { severity: warning }
annotations:
summary: "Rate limiter has >500k unique keys"
description: "Probably keying on a high-cardinality field (per-request ID instead of per-user). Memory will OOM Redis at scale."Set the LimiterKeyCardinalityHigh threshold against your actual user count — the failure mode it catches is "we accidentally keyed by request ID" which inflates Redis memory linearly with traffic.
Anti-Patterns We've Seen in Production
Three failure modes show up repeatedly in postmortems. Each looks fine in code review and unit tests; each surfaces only under real traffic.
Anti-pattern 1: Keying on the request URL with path parameters baked in
A team rolled out per-endpoint rate limiting using the raw request path as part of the key. The intent was reasonable — different limits for /api/v1/orders vs /api/v1/search. The implementation wasn't:
// Broken: every order ID becomes its own rate-limit key
key := fmt.Sprintf("rl:%s:%s", userID, r.URL.Path)
// → rl:user:42:/api/v1/orders/9f8e7d6c-...By Wednesday, Redis had 18 million keys, memory was at 11 GB, and the limiter was effectively no-op because every key saw exactly one request. The fix is to canonicalise the route before keying — most routers expose the matched template, not the resolved path:
// Fixed: use the route template, not the resolved URL
template := chi.RouteContext(r.Context()).RoutePattern()
// → /api/v1/orders/{id}
key := fmt.Sprintf("rl:%s:%s", userID, template)If your framework doesn't expose the template, hash the route definition at registration time and tag requests with it. Never let user-controlled URL segments enter the key.
Anti-pattern 2: Failing closed when the limiter store is unreachable
Redis goes down for 90 seconds during a failover. The limiter throws on every EVAL call. The middleware returns 500. The entire API is offline — not because the upstream is broken, but because the limiter is. We've seen this exact incident at three different companies.
The fix is fail-open with a conservative local fallback, plus a circuit breaker so you don't hammer a dead Redis:
func (l *Limiter) Allow(ctx context.Context, key string) (bool, error) {
if l.breaker.State() == StateOpen {
return l.localFallback.Allow(key), nil // permissive but bounded
}
allowed, err := l.redisCheck(ctx, key)
if err != nil {
l.breaker.RecordFailure()
metrics.LimiterFailedOpen.Inc()
return l.localFallback.Allow(key), nil
}
l.breaker.RecordSuccess()
return allowed, nil
}The local fallback should be tighter than the Redis-backed limit (e.g. 50% of nominal) to bound abuse during the outage. Page on LimiterFailedOpen so the on-call sees fail-open is active even when user-facing latency looks normal. [Google Doorman]
Anti-pattern 3: Shipping a new algorithm without shadow mode
Switching from fixed window to sliding window counter is a one-line config change. It's also a behaviour change — the boundary tolerance disappears, so clients that previously sneaked through right at the window edge now get 429s. We watched a team do this at noon on a weekday; their iOS app's retry logic wasn't compatible with Retry-After and went into a tight loop, tripling the very traffic the new limiter was supposed to control.
Always run the new limiter in shadow mode for 24-48 hours first. Compute the decision both ways, log the divergence, but only enforce the old one:
oldDecision := l.fixedWindow.Allow(key)
newDecision := l.slidingCounter.Allow(key)
if oldDecision != newDecision {
metrics.LimiterShadowDivergence.WithLabelValues(
boolToStr(oldDecision), boolToStr(newDecision),
).Inc()
}
return oldDecision // still authoritativeThe divergence metric tells you exactly which clients will be newly rejected before you flip the switch. If divergence is above ~2% of traffic, the cutover needs a coordinated client release first. [Google Doorman]
Frequently Asked Questions
What is the difference between token bucket and sliding window rate limiting?
Token bucket allows burst traffic up to the bucket capacity while enforcing an average rate, making it ideal for bursty clients like mobile apps. Sliding window enforces a strict count over a rolling time window, providing more precise rate enforcement but no burst tolerance.
Why is IP-based rate limiting unreliable?
IP-based rate limiting fails because enterprise users behind NAT share a single IP (causing false positives) and attackers rotate across thousands of IPs (causing false negatives). Use authenticated user ID or API key as the primary rate limit key, with IP limits only for unauthenticated endpoints.
How do you implement distributed rate limiting across multiple servers?
Use a shared store like Redis with atomic Lua scripts for centralized counting. For higher scale, use local in-memory rate limiters synchronized asynchronously with a central coordinator, accepting approximate enforcement to avoid adding network latency to every request.
What should a rate limiter return when a request is rejected?
Return HTTP 429 Too Many Requests with Retry-After, X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers. These headers let well-behaved clients implement exponential backoff and avoid hammering the API.
Keep Reading
- Scaling Redis for High-Throughput Workloads — Redis is your rate limit store; understand its latency profile and failure modes
- Caching Strategies at Scale — rate limiting often pairs with caching; understand the interaction
- Understanding Raft Consensus — rate limiting is a coordination problem; understanding consensus helps with distributed decisions
Engineering Team
A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.
Read Next
Consistent Hashing: The Algorithm Behind Every Scalable Distributed System
Adding one cache server shouldn't invalidate every key. Consistent hashing with virtual nodes and bounded loads — full Go and Java implementations.
Distributed Rate Limiting at Scale: The Probabilistic Drop Architecture
Probabilistic drop rate limiting: uncoordinated enforcement bypassing Redis for 1M+ RPS with zero coordination overhead.
Kafka vs RabbitMQ vs NATS vs SQS: Choosing the Right Message Broker
Kafka vs RabbitMQ vs NATS vs SQS: delivery semantics, ordering, throughput, ops complexity, and a decision framework with Go code.