#go #distributed-systems #resilience #circuit-breaker #bulkhead #idempotency

Building Resilient Distributed Systems with Go

BackendBytes Engineering Team

Jan 15, 2025

16 min read

Building Resilient Distributed Systems with Go

Part of Series: Distributed Systems Mastery

Lesson 4 of 6

Prev Next

Key Takeaways

→Slow dependencies cause more cascading damage than crashed ones — a 4.8s query held connections open, exhausting the pool; a crash would have freed resources immediately
→Circuit breaker prevents thundering herd: after N failures, open the circuit (fail fast); after timeout, probe; on success, close and resume — blocks cascade at the source
→Bulkhead isolates resource pools per dependency — one slow service exhausts only its own connections, leaving healthy dependencies untouched and serviceable
→Timeout propagation via context.WithTimeout down the call chain ensures the client's deadline is respected; without it, a 5-second client timeout becomes 10+ seconds via nested timeouts

A database query slows from 5 ms to 2 seconds. Nothing crashes — and that's exactly why it takes everything down. A product-page service calls it on every request, watches its own latency multiply, and burns through its connection pool until it fails. Every dependent service follows. We debugged this exact pattern on multiple production teams — slow is worse than crashed, because a crash frees resources immediately while slow holds them for the full timeout.

When One Slow Dependency Takes Down Your Whole System

A database query — not a crash — paralyses multiple services within minutes. A recommendation service's query plan degrades from milliseconds to seconds. A downstream product page service, calling it on every request, sees its own latency multiply. At sustained traffic, the product page burns through orders of magnitude more connections than it allocated. The pool exhausts. Everything depending on it fails. We've debugged variants of this on multiple services. Slow is worse than crashed because crashes free resources immediately; slow holds them for the full timeout^{[Beyer et al., 2016]}.

Bottom Line

Slow dependencies cause more damage than crashes because they hold resources open for timeout durations, cascading failure through callers^{[Beyer et al., 2016]}. Stop failures from propagating with circuit breakers (fail fast on broken deps), bulkheads (isolate resource pools), timeouts (enforce deadline chains), and idempotent retries (safe second attempts).

Circuit breaker: Open after N failures, probe after timeout, close on success — blocks cascade at the source
Bulkhead: Separate connection/goroutine limits per dependency — prevents one from starving others
Timeout propagation: Pass context deadlines down the call chain — respect the client's timeout budget

The Quick Start: Pattern Comparison

Pattern	Purpose	When	Cost
Timeout	Enforce deadline chains via `context.WithTimeout(ctx, duration)`^{[Go Language Specification]}	Every downstream call	Zero overhead; prevents hangs
Circuit Breaker	Open after N failures, probe after timeout	Each external dependency	Low; prevents thundering herd
Bulkhead	Separate concurrency limits per dependency	Per-service or per-endpoint	Low; prevents resource starvation
Idempotent Retry	Unique key per operation; retry on transient errors only	Mutating operations (orders, payments)	Moderate; requires request deduplication

Circuit Breaker: Stop Hitting a Failing Dependency

^{[Beyer et al., 2016]}

The circuit breaker is the most important resilience pattern. When a dependency is failing, stop calling it — fail fast instead of waiting for timeouts to expire. It has three states: Closed (normal operation), Open (dependency is failing, reject calls immediately), and HalfOpen (dependency may be recovering, allow probe requests).

stateDiagram-v2
    [*] --> Closed
    Closed --> Open: failure rate ≥ threshold
    Open --> HalfOpen: reset timeout elapsed
    HalfOpen --> Closed: probe(s) succeed
    HalfOpen --> Open: any probe fails

    note right of Closed
        Normal ops.
        Count failures.
    end note
    note right of Open
        Reject fast.
        Wait resetTimeout.
    end note
    note right of HalfOpen
        Allow N probes.
        Decide next state.
    end note

Three tunables drive behaviour: the failure threshold (percentage over a window that flips Closed → Open), the reset timeout (how long Open waits before probing), and the probe count in HalfOpen (too few misdiagnoses transient blips as recovered; too many hammer a still-failing dependency).

The Race Condition Most Implementations Get Wrong

The common mistake is upgrading from a read lock to a write lock between reads. Between releasing the read lock and acquiring the write lock, multiple goroutines can slip through and all try to transition to HalfOpen:

// WRONG: race between RUnlock and Lock
func (cb *CircuitBreaker) Execute(fn func() error) error {
    cb.mu.RLock()
    if cb.state == StateOpen {
        cb.mu.RUnlock()           // release read lock
        cb.mu.Lock()              // ← goroutines can slip between here
        cb.state = StateHalfOpen  // ← multiple can reach this
        cb.mu.Unlock()
    } else {
        cb.mu.RUnlock()
    }
    // ...
}

Result: multiple probe requests hit the recovering service simultaneously, causing it to fail again. The fix: use a single mutex for all state transitions, never upgrading locks.

type CircuitBreaker struct {
	mu        sync.Mutex
	state     int // 0=Closed, 1=Open, 2=HalfOpen
	failCount int
	timeout   time.Duration
	openedAt  time.Time
}
 
func (cb *CircuitBreaker) Execute(fn func() error) error {
	cb.mu.Lock()
 
	switch {
	case cb.state == StateOpen && time.Since(cb.openedAt) <= cb.timeout:
		// Open and still cooling down → fail fast.
		cb.mu.Unlock()
		return fmt.Errorf("circuit breaker open: dependency unavailable")
	case cb.state == StateOpen:
		// Timeout expired: promote to HalfOpen and let THIS request be the probe.
		cb.state = StateHalfOpen
	case cb.state == StateHalfOpen:
		// A probe is already in flight — admit only one at a time, so recovery
		// doesn't send a thundering herd at a dependency that's still fragile.
		cb.mu.Unlock()
		return fmt.Errorf("circuit breaker half-open: probe in progress")
	}
 
	cb.mu.Unlock()
	err := fn() // execute outside the lock to avoid holding it across I/O
	cb.mu.Lock()
	defer cb.mu.Unlock()
 
	if err != nil {
		cb.failCount++
		// A failed probe reopens immediately; otherwise trip at the threshold.
		if cb.state == StateHalfOpen || cb.failCount >= 5 {
			cb.state = StateOpen
			cb.openedAt = time.Now()
		}
	} else {
		cb.failCount = 0
		cb.state = StateClosed // success closes the circuit
	}
	return err
}

This ensures only one transition happens at a time, preventing the thundering herd of probes on recovery.

Bulkhead: Separate Resource Pools Per Dependency

The cascade failure in our opening story happened because all outbound calls — to recommendations, inventory, pricing — shared the same goroutine pool and connection pool. When the recommendation service slowed, it burned all available connections. Inventory and pricing calls starved.

The shared-pool failure mode versus the bulkhead-isolated fix:

graph TB
    subgraph SharedPool[BEFORE — shared pool]
        S1[100 connections] --> R1[Recommendations<br/>slow — holds 95 conns]
        S1 --> I1[Inventory<br/>cannot get conn — starves]
        S1 --> P1[Pricing<br/>cannot get conn — starves]
    end
    subgraph Bulkheaded[AFTER — bulkhead isolation]
        B1[40 conns —<br/>recommendations] --> RB[Recommendations<br/>slow — only this pool exhausted]
        B2[30 conns —<br/>inventory] --> IB[Inventory<br/>still healthy]
        B3[30 conns —<br/>pricing] --> PB[Pricing<br/>still healthy]
    end
    style R1 fill:#fdd
    style I1 fill:#fdd
    style P1 fill:#fdd
    style RB fill:#fdd
    style IB fill:#dfd
    style PB fill:#dfd

Bulkhead isolation fixes this: each dependency gets its own resource limit. One slow dependency exhausts only its own allocation, leaving others untouched.

type Bulkhead struct {
	sem chan struct{} // buffered channel acts as semaphore
}
 
func NewBulkhead(maxConcurrent int) *Bulkhead {
	return &Bulkhead{sem: make(chan struct{}, maxConcurrent)}
}
 
func (b *Bulkhead) Execute(ctx context.Context, fn func() error) error {
	select {
	case b.sem <- struct{}{}: // acquire one slot
		defer func() { <-b.sem }() // release on exit
		return fn()
	case <-ctx.Done():
		return ctx.Err() // timeout waiting for slot
	default:
		return fmt.Errorf("bulkhead full: %d concurrent limit exceeded", cap(b.sem))
	}
}

Wire separate bulkheads per dependency:

// Recommendations: generous (50 concurrent, less critical)
recommendations := NewBulkhead(50)
 
// Inventory: tight (100 concurrent, critical for checkout)
inventory := NewBulkhead(100)
 
// Pricing: medium (75 concurrent)
pricing := NewBulkhead(75)
 
// In your handler, each call is isolated
err := recommendations.Execute(ctx, func() error {
	return svc.getRecommendations(ctx, productID)
})
// If recommendations is at 50/50, inventory calls don't wait — they have their own pool

Even if recommendations burns all 50 slots with slow calls, inventory and pricing proceed independently.

Timeout Propagation: Respect the Parent Context Deadline

^{[Go context]}

Every downstream call must respect the incoming context's deadline. If the client's timeout is 500ms total and you make two 500ms calls sequentially, you've already violated the contract.

// WRONG: creates new context, loses parent deadline
func (svc *Service) CallDownstream(ctx context.Context) error {
	newCtx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
	defer cancel()
	return svc.downstream.Call(newCtx)
}
 
// RIGHT: derive from parent, add per-call buffer as safety net
func (svc *Service) CallDownstream(ctx context.Context) error {
	callCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
	defer cancel()
	return svc.downstream.Call(callCtx)
}

Budget timeout across multiple calls:

// Handler has 500ms total budget
ctx, cancel := context.WithTimeout(r.Context(), 500*time.Millisecond)
defer cancel()
 
// First call: 200ms
productCtx, _ := context.WithTimeout(ctx, 200*time.Millisecond)
product, _ := svc.getProduct(productCtx, id)
 
// Second call: uses remaining time from parent deadline (auto-enforced)
recs, _ := svc.getRecommendations(ctx, product.ID)

WithTimeout(parent, duration) never exceeds the parent's deadline — the tighter constraint wins.

Idempotent Retries: Safe Retries Without Duplicate Orders

^{[Stripe idempotency]}

Retries are essential for transient failures (network hiccups, 502s). But retrying non-idempotent operations — creating orders, charging cards — produces duplicates. The fix: send the same idempotency key on retry, and the server deduplicates.

// Generate a deterministic key once; use on all retry attempts
func generateIdempotencyKey(userID, productID string, amount float64) string {
	h := sha256.New()
	fmt.Fprintf(h, "%s:%s:%.2f", userID, productID, amount)
	return hex.EncodeToString(h.Sum(nil))
}
 
// On retry, server sees same key and returns original response, not a new order
idempotencyKey := generateIdempotencyKey(userID, productID, amount)
 
for attempt := 0; attempt < 3; attempt++ {
	req, err := http.NewRequestWithContext(ctx, "POST", orderServiceURL+"/orders", body)
	if err != nil {
		return nil, err
	}
	req.Header.Set("Idempotency-Key", idempotencyKey) // same key every attempt
	req.Header.Set("Content-Type", "application/json")
 
	resp, err := client.Do(req)
	if err != nil {
		if isRetryable(err) {
			continue // retry transient transport errors only
		}
		return nil, err // permanent error — resp is nil here, never dereference it
	}
	if resp.StatusCode >= 500 {
		resp.Body.Close() // release the connection before retrying
		continue // retry on 5xx
	}
	return resp, nil // success or permanent failure (4xx)
}

See Idempotency Patterns: Building Retry-Safe Distributed Systems for server-side deduplication strategies.

The Opening Incident: How These Patterns Prevent Cascade Failure

Recall the slow recommendation service: 4.8 second queries instead of 10ms. With these patterns in place, here's what happens:

Circuit breaker detects failure — after 5 consecutive timeouts (25 seconds total), it opens. Subsequent calls fail immediately with "circuit open" instead of waiting 5 seconds for timeout.
Bulkhead protects inventory and pricing — even though recommendation's 50-slot bulkhead fills with slow calls, inventory and pricing each have separate limits. They proceed unaffected.
Product page degrades gracefully — recommendation calls fail fast, returning an empty list fallback. Product page still renders in 80ms instead of 5 seconds.
Connection pool never exhausts — goroutines are released milliseconds after failing, not held open for full timeout duration.
Recommendation service recovers — DBA fixes the query plan. After 30 seconds, circuit transitions to HalfOpen and probes recovery with a single request. If successful, circuit closes and recommendations resume.

Result: users see product pages without recommendations for 25 seconds. Zero 500 errors. No cascade to dependent services. Every failure is isolated and degrades gracefully.

Production Checklist

For each external dependency:
[ ] Circuit breaker with single mutex (no RLock/Lock races)
[ ] Separate bulkhead (concurrency limit) per dependency
[ ] Idempotency key on all mutating operations (orders, payments)
[ ] Per-call timeout derived from parent context (never context.Background())
[ ] Retry only on retryable errors (503, 502, timeouts — not 400, 404)
 
For your service:
[ ] /healthz (liveness): checks process only, not dependencies
[ ] /readyz (readiness): checks critical dependencies with 3s timeout
[ ] Global request timeout on all handlers (e.g., 500ms for APIs)
[ ] Structured logging for circuit breaker state changes
[ ] Prometheus metrics: circuit open rate, bulkhead rejections, retry attempts

Alerting rules: catch resilience failures before users do

The Prometheus metrics above are useless without alert rules. The three rules below catch the cascade-precursor symptoms — circuit-breaker thrashing, bulkhead saturation, and runaway retry storms — at thresholds that fire minutes before user-visible latency:

# alerts/resilience.yml — paste alongside your service's existing PrometheusRule
groups:
  - name: resilience
    interval: 30s
    rules:
      - alert: CircuitBreakerOpenRateHigh
        expr: |
          (
            sum by (service, dependency) (
              rate(circuit_breaker_state_changes_total{to_state="open"}[5m])
            )
          ) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.dependency }} breaker opening repeatedly from {{ $labels.service }}"
          description: |
            Circuit breaker for {{ $labels.dependency }} has opened {{ $value }} times/sec
            over the last 5 minutes — usually means the dependency is degraded, not crashed.
            Check downstream latency p99 and connection-pool utilisation.
 
      - alert: BulkheadSaturated
        expr: |
          (
            bulkhead_active_calls{service=~".+"}
            / bulkhead_max_concurrent
          ) > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Bulkhead {{ $labels.dependency }} > 85% saturated"
          description: |
            {{ $value | humanizePercentage }} of the bulkhead slots are in use. Either traffic
            is exceeding capacity or the dependency is slow. Check {{ $labels.dependency }}
            latency before raising the bulkhead size.
 
      - alert: RetryStormDetected
        expr: |
          (
            rate(retry_attempts_total{outcome="retry"}[5m])
            / rate(retry_attempts_total[5m])
          ) > 0.3
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.service }} is in a retry storm against {{ $labels.dependency }}"
          description: |
            >30% of calls are being retried. Combined with an open circuit breaker this
            is the classic cascade-failure signature — the retry budget is amplifying load
            on a degraded dependency. Verify the breaker is firing and consider tightening
            retry budgets or shortening per-call timeouts.

Tune the 0.1, 0.85, and 0.3 thresholds against your steady-state baselines — every service has a different healthy retry rate. The pattern matters more than the numbers: rate-of-state-change for breakers, ratio-saturation for bulkheads, ratio-of-retries for retry budgets.

Hedged Requests: Trading a Little Load for a Lot of Tail Latency

Circuit breakers and bulkheads protect you from broken dependencies; hedged requests protect you from the long tail of healthy ones. The idea, popularised by Google's "Tail at Scale" paper^{[Dean & Barroso, 2013]}, is simple: if a request hasn't returned by the dependency's p95 latency, fire a duplicate to a different replica and take whichever response arrives first. The slow request gets cancelled. You pay a few percent extra load and crush p99 latency by an order of magnitude — useful when downstream replicas are independent (different shards, different pods) and the request is read-only or idempotent.

The two failure modes to design out are duplicate writes and runaway hedging on a globally degraded dependency. The first is solved by hedging only idempotent reads, or pairing the hedge with the same idempotency key from the retry section so the server deduplicates. The second is solved by capping hedges per parent request and gating them through the same circuit breaker: when the breaker is open, hedging is pointless and only adds load.

// HedgedCall fires a primary request immediately and a backup after `delay`.
// Whichever wins is returned; the loser is cancelled via context propagation.
func HedgedCall[T any](
    ctx context.Context,
    delay time.Duration,
    fn func(context.Context) (T, error),
) (T, error) {
    type result struct {
        val T
        err error
    }
    out := make(chan result, 2)
    callCtx, cancel := context.WithCancel(ctx)
    defer cancel() // cancels the loser as soon as we return
 
    go func() {
        v, err := fn(callCtx)
        out <- result{val: v, err: err}
    }()
 
    timer := time.NewTimer(delay)
    defer timer.Stop()
 
    select {
    case r := <-out:
        return r.val, r.err // primary returned before hedge fired
    case <-timer.C:
        go func() {
            v, err := fn(callCtx) // fire the duplicate
            out <- result{val: v, err: err}
        }()
        r := <-out
        return r.val, r.err
    case <-ctx.Done():
        var zero T
        return zero, ctx.Err()
    }
}

Pick delay from histograms — your dependency's measured p95, not a guess. Hedging at p50 doubles load for marginal gains; hedging at p99 fires too late to help. The cancellation is doing real work: when the hedge wins, the primary's callCtx is cancelled and any in-flight HTTP client honouring req.WithContext will close the connection mid-flight, freeing it for the next caller. Skip that, and a "winning" hedge still ties up resources until the original call drains.

Adaptive Concurrency Limits: Self-Tuning Bulkheads

Static bulkhead sizes are a guess. The right answer changes with traffic mix, GC pauses, downstream capacity, and the time of day. Netflix's concurrency-limits^{[Netflix concurrency-limits]} library treats the limit as a control variable: probe gently, increase when latency is healthy, halve when it isn't. The algorithm is AIMD — additive increase, multiplicative decrease — borrowed straight from TCP congestion control. It works because it's the same problem: many independent senders sharing a finite-capacity bottleneck, with no out-of-band signal of remaining headroom beyond the latency you observe.

The signal is queueing: if measured latency starts climbing above the lowest-observed baseline (a Little's Law proxy for a saturated server), shrink. If it stays close to baseline under load, grow. You never need a hand-tuned value because the limit converges on whatever the dependency can actually serve.

// AdaptiveLimit implements an AIMD controller over the bulkhead semaphore.
// Wrap each call with Acquire/Release; the limit retunes itself.
type AdaptiveLimit struct {
    mu          sync.Mutex
    limit       int           // current concurrency cap
    minLimit    int           // floor — never starve below this
    maxLimit    int           // ceiling — protect downstream
    inFlight    int
    minRTT      time.Duration // best observed latency
    sampleCount int
}
 
func (a *AdaptiveLimit) Acquire(ctx context.Context) (release func(time.Duration, error), err error) {
    a.mu.Lock()
    if a.inFlight >= a.limit {
        a.mu.Unlock()
        return nil, fmt.Errorf("adaptive limit reached: %d", a.limit)
    }
    a.inFlight++
    a.mu.Unlock()
 
    return func(rtt time.Duration, callErr error) {
        a.mu.Lock()
        defer a.mu.Unlock()
        a.inFlight--
        a.sampleCount++
 
        // Track best-case RTT as the no-queue baseline.
        if a.minRTT == 0 || rtt < a.minRTT {
            a.minRTT = rtt
        }
        // Multiplicative decrease on error or queueing latency (>2x baseline).
        if callErr != nil || rtt > 2*a.minRTT {
            a.limit = max(a.minLimit, a.limit/2)
            return
        }
        // Additive increase only when fully utilised — avoid drifting up on idle traffic.
        if a.inFlight+1 >= a.limit && a.limit < a.maxLimit {
            a.limit++
        }
    }, nil
}

In production this looks like a sawtooth that tracks the real bottleneck — peak hours grow the limit toward the ceiling, GC pauses or downstream incidents shrink it within seconds. It removes the worst part of static bulkheads: the team meeting where someone argues for a number that everyone forgets to revisit when traffic doubles.

A 30-Minute Outage Timeline: Where Each Pattern Earns Its Keep

To make the patterns concrete, here's a sketch of a familiar shape — one slow Postgres replica during a checkout-heavy hour. The point isn't the exact wall-clock; it's which patterns flip from "passive observation" to "active load shedding" at each stage.

T+00:00 — degradation begins. A query plan flips on the inventory replica after an autovacuum reshapes statistics. p99 climbs from 12 ms to 480 ms within ninety seconds. No alert fires yet — error rates are zero, the database is "healthy" by every traditional metric. This is the window where bad architectures lose. The adaptive concurrency limiter notices first: rtt has tripled relative to its observed minimum, so the controller halves the inventory bulkhead from ~80 to ~40. Healthy services keep their full quota.

T+02:30 — saturation. Sustained load pushes inventory rtt past the 2-second timeout. The circuit breaker tracks consecutive failures and opens at five — checkout calls now fail in microseconds with circuit breaker open: dependency unavailable, freeing goroutines that would otherwise wait the full timeout. The retry-storm alert (> 30% retried calls) fires; the breaker-open alert (> 0.1 openings/sec) fires a beat later. PagerDuty wakes someone.

T+05:00 — graceful degradation. With the breaker open, the cart-rendering path falls back to the cached inventory snapshot from Redis. Hedged requests against the secondary replica return in 14 ms; users see a banner ("Stock counts may lag by up to 2 minutes") instead of a 500. The bulkhead, now sized at ~40, protects pricing and recommendations from inheriting the failure — both stay green.

T+12:00 — operator action. The on-call engineer runs ANALYZE inventory_items; on the affected replica. The plan reverts. Replica latency drops to 18 ms. Nothing in the code path changes yet — the breaker is still open, so traffic is still flowing the cached path.

T+12:30 — controlled recovery. The breaker's reset timeout elapses. It transitions to half-open and admits a single probe; the probe succeeds; on the next request it closes. The adaptive limiter sees healthy rtt and begins additive growth, walking the inventory bulkhead from 40 back toward 80 over the next four minutes. No human flips a flag — the system retunes itself.

T+30:00 — postmortem. Total user-visible impact: a banner for twelve minutes. Zero 5xx. Zero abandoned carts above baseline. The retro item isn't "build more resilience patterns" — it's "tighten the bulkhead floor and add an alert on adaptive-limit collapse so we get woken at T+00:30 instead of T+02:30."

The pattern that mattered most isn't any single one; it's the layering. Adaptive limits found the degradation before humans did. The breaker turned slow failures into fast ones, releasing resources. Hedging carried the read path on the surviving replica. Idempotency keys meant no duplicate orders during retries. Timeout propagation meant the client's 500 ms budget was respected at every hop. Take any one of these out and the timeline ends with cascade failure and a 30-minute full outage instead of a banner.

Frequently Asked Questions

What is the circuit breaker pattern in distributed systems?

A circuit breaker monitors calls to a dependency and stops making requests when the failure rate exceeds a threshold (open state). After a timeout, it allows a single probe request (half-open state). If the probe succeeds, it closes and resumes normal traffic. This prevents cascade failures caused by slow or failing dependencies.

What is bulkhead isolation and why does it matter?

Bulkhead isolation limits the resources (connections, goroutines) dedicated to each dependency, so a failing dependency can only exhaust its own allocation. Without bulkheads, one slow service can consume all available connections and starve healthy dependencies.

How do you implement idempotent retries in Go?

Assign a unique idempotency key to each operation and store it with the result. On retry, check if the key already exists and return the stored result instead of re-executing. This ensures retries are safe even for non-idempotent operations like payments or order creation.

Why do slow dependencies cause more damage than crashed ones?

A crashed dependency fails fast, freeing resources immediately. A slow dependency holds connections open for the full timeout duration, consuming connection pool capacity and goroutines. At high concurrency, this exhausts resources far faster than outright failures.

Keep Reading

Idempotency Patterns: Building Retry-Safe Distributed Systems — The full idempotency key pattern, database constraints, and outbox pattern that make the retry logic in this article safe for payments and orders
Rate Limiter Algorithms: Token Bucket vs Sliding Window — The load shedding section here uses adaptive limits; this covers the five algorithms for request-level rate limiting and distributed coordination
Event-Driven Microservices in Go: Kafka, Sagas, and the Outbox Pattern — When your resilient services need to coordinate multi-step workflows across boundaries, sagas and the outbox pattern handle the distributed transaction problem

Was this article helpful?

Your feedback directly shapes our editorial depth and technical accuracy.