Distributed systems are inherently complex. Network partitions, node failures, and race conditions are not edge cases -- they are the norm. In this article, we will explore how Go's concurrency model makes it uniquely suited for building resilient backend services.
The circuit breaker pattern prevents cascading failures by wrapping calls to external services in a stateful proxy that monitors for failures and short-circuits requests when the failure rate exceeds a threshold.
Pro Tip: Always set your circuit breaker timeout higher than your service's P99 latency to avoid false positives during normal load spikes.
stateDiagram-v2
[*] --> Closed
Closed --> Open: Error Threshold Exceeded
Open --> HalfOpen: Sleep Window Expired
HalfOpen --> Closed: Success (Probe OK)
HalfOpen --> Open: Failure (Probe Failed)
note right of Closed: Normal State (Traffic flows freely)
note right of Open: Fail Fast (Requests rejected immediately)
note right of HalfOpen: Testing Recovery (Limited functionality)
type CircuitBreaker struct {
mu sync.RWMutex
state State
failCount int
threshold int
timeout time.Duration
lastFailure time.Time
}
func (cb *CircuitBreaker) Execute(fn func() error) error {
cb.mu.RLock()
if cb.state == StateOpen {
if time.Since(cb.lastFailure) > cb.timeout {
cb.mu.RUnlock()
cb.mu.Lock()
cb.state = StateHalfOpen
cb.mu.Unlock()
} else {
cb.mu.RUnlock()
return ErrCircuitOpen
}
} else {
cb.mu.RUnlock()
}
err := fn()
if err != nil {
cb.recordFailure()
return err
}
cb.recordSuccess()
return nil
}When a dependency fails, your service should degrade gracefully rather than failing entirely. This means returning cached data, default values, or partial responses instead of errors. The key insight is that partial availability is almost always better than total unavailability.
Consider a product page that fetches data from multiple microservices: product details, reviews, recommendations, and pricing. If the recommendations service is down, you should still show the product with its reviews and pricing, perhaps with a generic "popular items" fallback for recommendations.
Implementing retries with exponential backoff and jitter prevents thundering herd problems when services recover from failures.
func RetryWithBackoff(ctx context.Context, maxRetries int, fn func() error) error {
for i := 0; i < maxRetries; i++ {
err := fn()
if err == nil {
return nil
}
base := time.Duration(1<<uint(i)) * 100 * time.Millisecond
jitter := time.Duration(rand.Int63n(int64(base / 2)))
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(base + jitter):
}
}
return fmt.Errorf("max retries (%d) exceeded", maxRetries)
}The jitter component is crucial. Without it, all clients retry at exactly the same time, creating a synchronized burst that can overwhelm the recovering service. Jitter spreads retries across time, giving the service a chance to recover gradually.
sequenceDiagram
participant Client
participant Service
Client->>Service: Request 1 (Fail)
Service-->>Client: 503 Internal Error
Note over Client: Backoff 100ms
Client->>Service: Retry 1 (Fail)
Service-->>Client: 503 Internal Error
Note over Client: Backoff 200ms + Jitter
Client->>Service: Retry 2 (Success)
Service-->>Client: 200 OK
Proper health checks allow your orchestrator (Kubernetes, Nomad, etc.) to make informed decisions about routing traffic and restarting unhealthy instances. A liveness probe checks whether the process is alive, while a readiness probe checks whether it is ready to accept traffic.
func (s *Server) healthHandler(w http.ResponseWriter, r *http.Request) {
checks := map[string]error{
"database": s.db.Ping(r.Context()),
"cache": s.cache.Ping(r.Context()),
"queue": s.queue.Ping(r.Context()),
}
healthy := true
for _, err := range checks {
if err != nil {
healthy = false
break
}
}
if healthy {
w.WriteHeader(http.StatusOK)
} else {
w.WriteHeader(http.StatusServiceUnavailable)
}
json.NewEncoder(w).Encode(checks)
}Building resilient distributed systems requires thinking about failure from the start. Go provides excellent primitives for concurrency and error handling that make it a natural fit for this domain. Combine circuit breakers, graceful degradation, and smart retry strategies to build services that withstand real-world conditions.
System Architecture Group
Experts in distributed systems, scalability, and high-performance computing.
A practical guide to the Raft consensus algorithm with Go implementation examples. Learn leader election, log replication, and safety guarantees.
Explore advanced caching patterns including write-through, write-behind, cache-aside, and distributed caching with Redis Cluster for high-throughput systems.
Stop blindly choosing a database. We benchmark performance, analyze consistency models, and compare operational complexity for high-scale workloads.