Building Resilient Distributed Systems with Go
Key Takeaways
- →Slow dependencies cause more cascading damage than crashed ones — a 4.8s query held connections open, exhausting the pool; a crash would have freed resources immediately
- →Circuit breaker prevents thundering herd: after N failures, open the circuit (fail fast); after timeout, probe; on success, close and resume — blocks cascade at the source
- →Bulkhead isolates resource pools per dependency — one slow service exhausts only its own connections, leaving healthy dependencies untouched and serviceable
- →Timeout propagation via context.WithTimeout down the call chain ensures the client's deadline is respected; without it, a 5-second client timeout becomes 10+ seconds via nested timeouts
The classic slow-dependency production cascade. A database query degrades from 5 ms to 2 seconds. A downstream product-page service calling it on every request sees its own latency multiply, burns its connection pool, and fails. Every dependent service follows. We debugged this exact failure pattern on multiple production teams — slow is worse than crashed because crashes free resources immediately, while slow holds them for the full timeout.
When One Slow Dependency Takes Down Your Whole System
The classic slow-dependency-cascade pattern: a database query — not a crash — paralyses multiple services within minutes. A recommendation service's query plan degrades from milliseconds to seconds. A downstream product page service, calling it on every request, sees its own latency multiply. At sustained traffic, the product page burns through orders of magnitude more connections than it allocated. The pool exhausts. Everything depending on it fails. We've debugged variants of this on multiple services. Slow is worse than crashed because crashes free resources immediately; slow holds them for the full timeout[Beyer et al., 2016].
Slow dependencies cause more damage than crashes because they hold resources open for timeout durations, cascading failure through callers[Beyer et al., 2016]. Stop failures from propagating with circuit breakers (fail fast on broken deps), bulkheads (isolate resource pools), timeouts (enforce deadline chains), and idempotent retries (safe second attempts).
- Circuit breaker: Open after N failures, probe after timeout, close on success — blocks cascade at the source
- Bulkhead: Separate connection/goroutine limits per dependency — prevents one from starving others
- Timeout propagation: Pass context deadlines down the call chain — respect the client's timeout budget
The Quick Start: Pattern Comparison
| Pattern | Purpose | When | Cost |
|---|---|---|---|
| Timeout | Enforce deadline chains via context.WithTimeout(ctx, duration)[Go Language Specification] | Every downstream call | Zero overhead; prevents hangs |
| Circuit Breaker | Open after N failures, probe after timeout | Each external dependency | Low; prevents thundering herd |
| Bulkhead | Separate concurrency limits per dependency | Per-service or per-endpoint | Low; prevents resource starvation |
| Idempotent Retry | Unique key per operation; retry on transient errors only | Mutating operations (orders, payments) | Moderate; requires request deduplication |
Circuit Breaker: Stop Hitting a Failing Dependency
[Beyer et al., 2016]The circuit breaker is the most important resilience pattern. When a dependency is failing, stop calling it — fail fast instead of waiting for timeouts to expire. It has three states: Closed (normal operation), Open (dependency is failing, reject calls immediately), and HalfOpen (dependency may be recovering, allow probe requests).
stateDiagram-v2
[*] --> Closed
Closed --> Open: failure rate ≥ threshold
Open --> HalfOpen: reset timeout elapsed
HalfOpen --> Closed: probe(s) succeed
HalfOpen --> Open: any probe fails
note right of Closed
Normal ops.
Count failures.
end note
note right of Open
Reject fast.
Wait resetTimeout.
end note
note right of HalfOpen
Allow N probes.
Decide next state.
end note
Three tunables drive behaviour: the failure threshold (percentage over a window that flips Closed → Open), the reset timeout (how long Open waits before probing), and the probe count in HalfOpen (too few misdiagnoses transient blips as recovered; too many hammer a still-failing dependency).
The Race Condition Most Implementations Get Wrong
The common mistake is upgrading from a read lock to a write lock between reads. Between releasing the read lock and acquiring the write lock, multiple goroutines can slip through and all try to transition to HalfOpen:
// WRONG: race between RUnlock and Lock
func (cb *CircuitBreaker) Execute(fn func() error) error {
cb.mu.RLock()
if cb.state == StateOpen {
cb.mu.RUnlock() // release read lock
cb.mu.Lock() // ← goroutines can slip between here
cb.state = StateHalfOpen // ← multiple can reach this
cb.mu.Unlock()
} else {
cb.mu.RUnlock()
}
// ...
}Result: multiple probe requests hit the recovering service simultaneously, causing it to fail again. The fix: use a single mutex for all state transitions, never upgrading locks.
type CircuitBreaker struct {
mu sync.Mutex
state int // 0=Closed, 1=Open, 2=HalfOpen
failCount int
timeout time.Duration
openedAt time.Time
}
func (cb *CircuitBreaker) Execute(fn func() error) error {
cb.mu.Lock()
defer cb.mu.Unlock()
// Check if circuit is open and timeout hasn't expired
if cb.state == StateOpen && time.Since(cb.openedAt) <= cb.timeout {
return fmt.Errorf("circuit breaker open: dependency unavailable")
}
// Transition from Open to HalfOpen when timeout expires
if cb.state == StateOpen {
cb.state = StateHalfOpen // only inside lock
}
cb.mu.Unlock()
err := fn() // execute outside lock to avoid deadlock
cb.mu.Lock()
// Record result
if err != nil {
cb.failCount++
if cb.failCount >= 5 { // threshold: 5 failures
cb.state = StateOpen
cb.openedAt = time.Now()
}
} else {
cb.failCount = 0
cb.state = StateClosed // success closes the circuit
}
return err
}This ensures only one transition happens at a time, preventing the thundering herd of probes on recovery.
Bulkhead: Separate Resource Pools Per Dependency
The cascade failure in our opening story happened because all outbound calls — to recommendations, inventory, pricing — shared the same goroutine pool and connection pool. When the recommendation service slowed, it burned all available connections. Inventory and pricing calls starved.
The shared-pool failure mode versus the bulkhead-isolated fix:
graph TB
subgraph SharedPool[BEFORE — shared pool]
S1[100 connections] --> R1[Recommendations<br/>slow — holds 95 conns]
S1 --> I1[Inventory<br/>cannot get conn — starves]
S1 --> P1[Pricing<br/>cannot get conn — starves]
end
subgraph Bulkheaded[AFTER — bulkhead isolation]
B1[40 conns —<br/>recommendations] --> RB[Recommendations<br/>slow — only this pool exhausted]
B2[30 conns —<br/>inventory] --> IB[Inventory<br/>still healthy]
B3[30 conns —<br/>pricing] --> PB[Pricing<br/>still healthy]
end
style R1 fill:#fdd
style I1 fill:#fdd
style P1 fill:#fdd
style RB fill:#fdd
style IB fill:#dfd
style PB fill:#dfd
Bulkhead isolation fixes this: each dependency gets its own resource limit. One slow dependency exhausts only its own allocation, leaving others untouched.
type Bulkhead struct {
sem chan struct{} // buffered channel acts as semaphore
}
func NewBulkhead(maxConcurrent int) *Bulkhead {
return &Bulkhead{sem: make(chan struct{}, maxConcurrent)}
}
func (b *Bulkhead) Execute(ctx context.Context, fn func() error) error {
select {
case b.sem <- struct{}{}: // acquire one slot
defer func() { <-b.sem }() // release on exit
return fn()
case <-ctx.Done():
return ctx.Err() // timeout waiting for slot
default:
return fmt.Errorf("bulkhead full: %d concurrent limit exceeded", cap(b.sem))
}
}Wire separate bulkheads per dependency:
// Recommendations: generous (50 concurrent, less critical)
recommendations := NewBulkhead(50)
// Inventory: tight (100 concurrent, critical for checkout)
inventory := NewBulkhead(100)
// Pricing: medium (75 concurrent)
pricing := NewBulkhead(75)
// In your handler, each call is isolated
err := recommendations.Execute(ctx, func() error {
return svc.getRecommendations(ctx, productID)
})
// If recommendations is at 50/50, inventory calls don't wait — they have their own poolEven if recommendations burns all 50 slots with slow calls, inventory and pricing proceed independently.
Timeout Propagation: Respect the Parent Context Deadline
[Go context]Every downstream call must respect the incoming context's deadline. If the client's timeout is 500ms total and you make two 500ms calls sequentially, you've already violated the contract.
// WRONG: creates new context, loses parent deadline
func (svc *Service) CallDownstream(ctx context.Context) error {
newCtx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()
return svc.downstream.Call(newCtx)
}
// RIGHT: derive from parent, add per-call buffer as safety net
func (svc *Service) CallDownstream(ctx context.Context) error {
callCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
defer cancel()
return svc.downstream.Call(callCtx)
}Budget timeout across multiple calls:
// Handler has 500ms total budget
ctx, cancel := context.WithTimeout(r.Context(), 500*time.Millisecond)
defer cancel()
// First call: 200ms
productCtx, _ := context.WithTimeout(ctx, 200*time.Millisecond)
product, _ := svc.getProduct(productCtx, id)
// Second call: uses remaining time from parent deadline (auto-enforced)
recs, _ := svc.getRecommendations(ctx, product.ID)WithTimeout(parent, duration) never exceeds the parent's deadline — the tighter constraint wins.
Idempotent Retries: Safe Retries Without Duplicate Orders
[Stripe idempotency]Retries are essential for transient failures (network hiccups, 502s). But retrying non-idempotent operations — creating orders, charging cards — produces duplicates. The fix: send the same idempotency key on retry, and the server deduplicates.
// Generate a deterministic key once; use on all retry attempts
func generateIdempotencyKey(userID, productID string, amount float64) string {
h := sha256.New()
fmt.Fprintf(h, "%s:%s:%.2f", userID, productID, amount)
return hex.EncodeToString(h.Sum(nil))
}
// On retry, server sees same key and returns original response, not a new order
idempotencyKey := generateIdempotencyKey(userID, productID, amount)
for attempt := 0; attempt < 3; attempt++ {
req, _ := http.NewRequestWithContext(ctx, "POST", orderServiceURL+"/orders", body)
req.Header.Set("Idempotency-Key", idempotencyKey) // same key every attempt
req.Header.Set("Content-Type", "application/json")
resp, err := client.Do(req)
if err != nil && isRetryable(err) {
continue // retry on transient errors only
}
if resp.StatusCode >= 500 {
continue // retry on 5xx
}
return resp // success or permanent failure (4xx)
}See Idempotency Patterns: Building Retry-Safe Distributed Systems for server-side deduplication strategies.
The Opening Incident: How These Patterns Prevent Cascade Failure
Recall the slow recommendation service: 4.8 second queries instead of 10ms. With these patterns in place, here's what happens:
-
Circuit breaker detects failure — after 5 consecutive timeouts (25 seconds total), it opens. Subsequent calls fail immediately with "circuit open" instead of waiting 5 seconds for timeout.
-
Bulkhead protects inventory and pricing — even though recommendation's 50-slot bulkhead fills with slow calls, inventory and pricing each have separate limits. They proceed unaffected.
-
Product page degrades gracefully — recommendation calls fail fast, returning an empty list fallback. Product page still renders in 80ms instead of 5 seconds.
-
Connection pool never exhausts — goroutines are released milliseconds after failing, not held open for full timeout duration.
-
Recommendation service recovers — DBA fixes the query plan. After 30 seconds, circuit transitions to HalfOpen and probes recovery with a single request. If successful, circuit closes and recommendations resume.
Result: users see product pages without recommendations for 25 seconds. Zero 500 errors. No cascade to dependent services. Every failure is isolated and degrades gracefully.
Production Checklist
For each external dependency:
[ ] Circuit breaker with single mutex (no RLock/Lock races)
[ ] Separate bulkhead (concurrency limit) per dependency
[ ] Idempotency key on all mutating operations (orders, payments)
[ ] Per-call timeout derived from parent context (never context.Background())
[ ] Retry only on retryable errors (503, 502, timeouts — not 400, 404)
For your service:
[ ] /healthz (liveness): checks process only, not dependencies
[ ] /readyz (readiness): checks critical dependencies with 3s timeout
[ ] Global request timeout on all handlers (e.g., 500ms for APIs)
[ ] Structured logging for circuit breaker state changes
[ ] Prometheus metrics: circuit open rate, bulkhead rejections, retry attemptsAlerting rules: catch resilience failures before users do
The Prometheus metrics above are useless without alert rules. The three rules below catch the cascade-precursor symptoms — circuit-breaker thrashing, bulkhead saturation, and runaway retry storms — at thresholds that fire minutes before user-visible latency:
# alerts/resilience.yml — paste alongside your service's existing PrometheusRule
groups:
- name: resilience
interval: 30s
rules:
- alert: CircuitBreakerOpenRateHigh
expr: |
(
sum by (service, dependency) (
rate(circuit_breaker_state_changes_total{to_state="open"}[5m])
)
) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "{{ $labels.dependency }} breaker opening repeatedly from {{ $labels.service }}"
description: |
Circuit breaker for {{ $labels.dependency }} has opened {{ $value }} times/sec
over the last 5 minutes — usually means the dependency is degraded, not crashed.
Check downstream latency p99 and connection-pool utilisation.
- alert: BulkheadSaturated
expr: |
(
bulkhead_active_calls{service=~".+"}
/ bulkhead_max_concurrent
) > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Bulkhead {{ $labels.dependency }} > 85% saturated"
description: |
{{ $value | humanizePercentage }} of the bulkhead slots are in use. Either traffic
is exceeding capacity or the dependency is slow. Check {{ $labels.dependency }}
latency before raising the bulkhead size.
- alert: RetryStormDetected
expr: |
(
rate(retry_attempts_total{outcome="retry"}[5m])
/ rate(retry_attempts_total[5m])
) > 0.3
for: 3m
labels:
severity: critical
annotations:
summary: "{{ $labels.service }} is in a retry storm against {{ $labels.dependency }}"
description: |
>30% of calls are being retried. Combined with an open circuit breaker this
is the classic cascade-failure signature — the retry budget is amplifying load
on a degraded dependency. Verify the breaker is firing and consider tightening
retry budgets or shortening per-call timeouts.Tune the 0.1, 0.85, and 0.3 thresholds against your steady-state baselines — every service has a different healthy retry rate. The pattern matters more than the numbers: rate-of-state-change for breakers, ratio-saturation for bulkheads, ratio-of-retries for retry budgets.
Hedged Requests: Trading a Little Load for a Lot of Tail Latency
Circuit breakers and bulkheads protect you from broken dependencies; hedged requests protect you from the long tail of healthy ones. The idea, popularised by Google's "Tail at Scale" paper[Dean & Barroso, 2013], is simple: if a request hasn't returned by the dependency's p95 latency, fire a duplicate to a different replica and take whichever response arrives first. The slow request gets cancelled. You pay a few percent extra load and crush p99 latency by an order of magnitude — useful when downstream replicas are independent (different shards, different pods) and the request is read-only or idempotent.
The two failure modes to design out are duplicate writes and runaway hedging on a globally degraded dependency. The first is solved by hedging only idempotent reads, or pairing the hedge with the same idempotency key from the retry section so the server deduplicates. The second is solved by capping hedges per parent request and gating them through the same circuit breaker: when the breaker is open, hedging is pointless and only adds load.
// HedgedCall fires a primary request immediately and a backup after `delay`.
// Whichever wins is returned; the loser is cancelled via context propagation.
func HedgedCall[T any](
ctx context.Context,
delay time.Duration,
fn func(context.Context) (T, error),
) (T, error) {
type result struct {
val T
err error
}
out := make(chan result, 2)
callCtx, cancel := context.WithCancel(ctx)
defer cancel() // cancels the loser as soon as we return
go func() {
v, err := fn(callCtx)
out <- result{val: v, err: err}
}()
timer := time.NewTimer(delay)
defer timer.Stop()
select {
case r := <-out:
return r.val, r.err // primary returned before hedge fired
case <-timer.C:
go func() {
v, err := fn(callCtx) // fire the duplicate
out <- result{val: v, err: err}
}()
r := <-out
return r.val, r.err
case <-ctx.Done():
var zero T
return zero, ctx.Err()
}
}Pick delay from histograms — your dependency's measured p95, not a guess. Hedging at p50 doubles load for marginal gains; hedging at p99 fires too late to help. The cancellation is doing real work: when the hedge wins, the primary's callCtx is cancelled and any in-flight HTTP client honouring req.WithContext will close the connection mid-flight, freeing it for the next caller. Skip that, and a "winning" hedge still ties up resources until the original call drains.
Adaptive Concurrency Limits: Self-Tuning Bulkheads
Static bulkhead sizes are a guess. The right answer changes with traffic mix, GC pauses, downstream capacity, and the time of day. Netflix's concurrency-limits[Netflix concurrency-limits] library treats the limit as a control variable: probe gently, increase when latency is healthy, halve when it isn't. The algorithm is AIMD — additive increase, multiplicative decrease — borrowed straight from TCP congestion control. It works because it's the same problem: many independent senders sharing a finite-capacity bottleneck, with no out-of-band signal of remaining headroom beyond the latency you observe.
The signal is queueing: if measured latency starts climbing above the lowest-observed baseline (a Little's Law proxy for a saturated server), shrink. If it stays close to baseline under load, grow. You never need a hand-tuned value because the limit converges on whatever the dependency can actually serve.
// AdaptiveLimit implements an AIMD controller over the bulkhead semaphore.
// Wrap each call with Acquire/Release; the limit retunes itself.
type AdaptiveLimit struct {
mu sync.Mutex
limit int // current concurrency cap
minLimit int // floor — never starve below this
maxLimit int // ceiling — protect downstream
inFlight int
minRTT time.Duration // best observed latency
sampleCount int
}
func (a *AdaptiveLimit) Acquire(ctx context.Context) (release func(time.Duration, error), err error) {
a.mu.Lock()
if a.inFlight >= a.limit {
a.mu.Unlock()
return nil, fmt.Errorf("adaptive limit reached: %d", a.limit)
}
a.inFlight++
a.mu.Unlock()
return func(rtt time.Duration, callErr error) {
a.mu.Lock()
defer a.mu.Unlock()
a.inFlight--
a.sampleCount++
// Track best-case RTT as the no-queue baseline.
if a.minRTT == 0 || rtt < a.minRTT {
a.minRTT = rtt
}
// Multiplicative decrease on error or queueing latency (>2x baseline).
if callErr != nil || rtt > 2*a.minRTT {
a.limit = max(a.minLimit, a.limit/2)
return
}
// Additive increase only when fully utilised — avoid drifting up on idle traffic.
if a.inFlight+1 >= a.limit && a.limit < a.maxLimit {
a.limit++
}
}, nil
}In production this looks like a sawtooth that tracks the real bottleneck — peak hours grow the limit toward the ceiling, GC pauses or downstream incidents shrink it within seconds. It removes the worst part of static bulkheads: the team meeting where someone argues for a number that everyone forgets to revisit when traffic doubles.
A 30-Minute Outage Timeline: Where Each Pattern Earns Its Keep
To make the patterns concrete, here's a sketch of a familiar shape — one slow Postgres replica during a checkout-heavy hour. The point isn't the exact wall-clock; it's which patterns flip from "passive observation" to "active load shedding" at each stage.
T+00:00 — degradation begins. A query plan flips on the inventory replica after an autovacuum reshapes statistics. p99 climbs from 12 ms to 480 ms within ninety seconds. No alert fires yet — error rates are zero, the database is "healthy" by every traditional metric. This is the window where bad architectures lose. The adaptive concurrency limiter notices first: rtt has tripled relative to its observed minimum, so the controller halves the inventory bulkhead from ~80 to ~40. Healthy services keep their full quota.
T+02:30 — saturation. Sustained load pushes inventory rtt past the 2-second timeout. The circuit breaker tracks consecutive failures and opens at five — checkout calls now fail in microseconds with circuit breaker open: dependency unavailable, freeing goroutines that would otherwise wait the full timeout. The retry-storm alert (> 30% retried calls) fires; the breaker-open alert (> 0.1 openings/sec) fires a beat later. PagerDuty wakes someone.
T+05:00 — graceful degradation. With the breaker open, the cart-rendering path falls back to the cached inventory snapshot from Redis. Hedged requests against the secondary replica return in 14 ms; users see a banner ("Stock counts may lag by up to 2 minutes") instead of a 500. The bulkhead, now sized at ~40, protects pricing and recommendations from inheriting the failure — both stay green.
T+12:00 — operator action. The on-call engineer runs ANALYZE inventory_items; on the affected replica. The plan reverts. Replica latency drops to 18 ms. Nothing in the code path changes yet — the breaker is still open, so traffic is still flowing the cached path.
T+12:30 — controlled recovery. The breaker's reset timeout elapses. It transitions to half-open and admits a single probe; the probe succeeds; on the next request it closes. The adaptive limiter sees healthy rtt and begins additive growth, walking the inventory bulkhead from 40 back toward 80 over the next four minutes. No human flips a flag — the system retunes itself.
T+30:00 — postmortem. Total user-visible impact: a banner for twelve minutes. Zero 5xx. Zero abandoned carts above baseline. The retro item isn't "build more resilience patterns" — it's "tighten the bulkhead floor and add an alert on adaptive-limit collapse so we get woken at T+00:30 instead of T+02:30."
The pattern that mattered most isn't any single one; it's the layering. Adaptive limits found the degradation before humans did. The breaker turned slow failures into fast ones, releasing resources. Hedging carried the read path on the surviving replica. Idempotency keys meant no duplicate orders during retries. Timeout propagation meant the client's 500 ms budget was respected at every hop. Take any one of these out and the timeline ends with cascade failure and a 30-minute full outage instead of a banner.
Frequently Asked Questions
What is the circuit breaker pattern in distributed systems?
A circuit breaker monitors calls to a dependency and stops making requests when the failure rate exceeds a threshold (open state). After a timeout, it allows a single probe request (half-open state). If the probe succeeds, it closes and resumes normal traffic. This prevents cascade failures caused by slow or failing dependencies.
What is bulkhead isolation and why does it matter?
Bulkhead isolation limits the resources (connections, goroutines) dedicated to each dependency, so a failing dependency can only exhaust its own allocation. Without bulkheads, one slow service can consume all available connections and starve healthy dependencies.
How do you implement idempotent retries in Go?
Assign a unique idempotency key to each operation and store it with the result. On retry, check if the key already exists and return the stored result instead of re-executing. This ensures retries are safe even for non-idempotent operations like payments or order creation.
Why do slow dependencies cause more damage than crashed ones?
A crashed dependency fails fast, freeing resources immediately. A slow dependency holds connections open for the full timeout duration, consuming connection pool capacity and goroutines. At high concurrency, this exhausts resources far faster than outright failures.
Keep Reading
- Idempotency Patterns: Building Retry-Safe Distributed Systems — The full idempotency key pattern, database constraints, and outbox pattern that make the retry logic in this article safe for payments and orders
- Rate Limiter Algorithms: Token Bucket vs Sliding Window — The load shedding section here uses adaptive limits; this covers the five algorithms for request-level rate limiting and distributed coordination
- Event-Driven Microservices in Go: Kafka, Sagas, and the Outbox Pattern — When your resilient services need to coordinate multi-step workflows across boundaries, sagas and the outbox pattern handle the distributed transaction problem
Engineering Team
A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.
Read Next
Consistent Hashing: The Algorithm Behind Every Scalable Distributed System
Adding one cache server shouldn't invalidate every key. Consistent hashing with virtual nodes and bounded loads — full Go and Java implementations.
Distributed Rate Limiting at Scale: The Probabilistic Drop Architecture
Probabilistic drop rate limiting: uncoordinated enforcement bypassing Redis for 1M+ RPS with zero coordination overhead.
Kafka vs RabbitMQ vs NATS vs SQS: Choosing the Right Message Broker
Kafka vs RabbitMQ vs NATS vs SQS: delivery semantics, ordering, throughput, ops complexity, and a decision framework with Go code.