Go context.Context Cheat Sheet: Cancellation, Timeouts & Gotchas
Key Takeaways
- →Every WithCancel/WithTimeout without defer cancel() leaks a goroutine per request — missed cancellations compound at scale until the service OOMs
- →Child contexts inherit the parent deadline and can never extend it — a 5-second WithTimeout on a 1-second parent budget always gets 1 second, not 5
- →Request-scoped values require private unexported key types, not strings — exported keys collide across packages, creating silent data corruption bugs
- →Never create a fresh Background() with WithTimeout inside a request handler — you orphan the http.Request.Context() and lose client disconnect signals
The classic Go context production leak. A handler spawns a goroutine, returns a 200, and the goroutine outlives the request because nobody called
defer cancel(). Multiply by 10 thousand requests per second and the process leaks goroutines until OOMKilled. We debugged this exact pattern on multiple production services — it is the single most common shape of goroutine leak.
The five constructors you need
[Go context]Every context.Context tree starts with one of two roots. Everything else is a child wrapper.
Background() is your tree root. WithCancel, WithTimeout, and WithDeadline leak without defer cancel(). Never pass nil, store context in a struct, or use Value for function arguments.
- Always pair
WithCancel/Timeout/Deadlinewithdefer cancel()to prevent leaks - Use
WithTimeoutfor RPC calls;WithDeadlinefor splitting parent budgets - Pass context as the first parameter; never cache it in a struct
graph TD
BG["context.Background()"] --> WC["WithCancel(bg)"]
BG --> WT["WithTimeout(bg, 5s)"]
WC --> WV["WithValue(wc, key, val)"]
WC -->|"spawns"| G1["goroutine A<br/>← ctx.Done()"]
WT -->|"spawns"| G2["goroutine B<br/>← ctx.Done()"]
WV -->|"spawns"| G3["goroutine C<br/>ctx.Value(key)"]
Constructor reference
ctx := context.Background() // root, never cancelled
ctx := context.TODO() // placeholder, don't ship
ctx, cancel := context.WithCancel(parent) // manual cancellation
ctx, cancel := context.WithTimeout(parent, 2*time.Second) // relative deadline
ctx, cancel := context.WithDeadline(parent, absoluteTime) // absolute deadline
ctx := context.WithValue(parent, key, val) // request-scoped valueBackground() is the root in main, tests, and background workers. TODO() signals refactoring—never deploy it. All cancel-returning constructors must have defer cancel() on the next line. Missed cancel() leaks a goroutine per request.
How cancellation propagates
A context is a tree. When a parent cancels, every descendant sees Done() close simultaneously — one signal fans out across every goroutine that branched off.
graph TD
A["handler<br/>Background()"] --> B["WithTimeout 2s<br/>client.Get()"]
A --> C["WithCancel<br/>spawner"]
C --> D["db.Query()"]
C --> E["cache.Set()"]
C --> F["publish()"]
B -. deadline fires .-> X(("Done() closes<br/>this subtree"))
C -. cancel() called .-> Y(("Done() closes<br/>all 3 goroutines"))
Every leaf goroutine must watch <-ctx.Done() alongside its real work. Otherwise cancellation can't reach it and the goroutine leaks.
Cancellation and timeouts
Use WithCancel for explicit cancellation. Use WithTimeout for RPC calls (2s budget). Use WithDeadline when splitting a parent budget across sequential calls:
ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
defer cancel()
resp, err := client.Get(ctx, "/orders/42")
if errors.Is(err, context.DeadlineExceeded) {
return http.StatusGatewayTimeout, nil
}We've debugged this mistake in production: never create a fresh context.Background() with WithTimeout inside a request handler—you orphan the request context and lose client disconnect signals.
Request-scoped values
WithValue stores per-request data (request IDs, trace spans, user) across API boundaries. Not for function arguments.
type ctxKey int
const requestIDKey ctxKey = iota
func WithRequestID(ctx context.Context, id string) context.Context {
return context.WithValue(ctx, requestIDKey, id)
}
func RequestID(ctx context.Context) string {
v, _ := ctx.Value(requestIDKey).(string)
return v
}Always use a private unexported key type (not string). Exported keys collide. Type-assert; never assume presence.
HTTP and gRPC integration
[Go net/http]HTTP handlers get a request context via r.Context(). Wrap it with WithTimeout for your I/O budget:
func handler(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 500*time.Millisecond)
defer cancel()
rows, err := db.QueryContext(ctx, "SELECT id FROM orders WHERE user_id=$1", userID)
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
defer rows.Close()
}For outbound HTTP, use http.NewRequestWithContext(ctx, ...). For gRPC, methods take ctx as the first parameter. Deadlines propagate as grpc-timeout headers automatically.
Common gotchas
- Store context in a struct — Contexts live per-call, not per-object[Go context]. Pass it explicitly to every method.
- Pass
nilcontext — Causes a panic. Usecontext.TODO()as a placeholder. - Forget
defer cancel()—WithCancel,WithTimeout,WithDeadlineleak if not cleaned up[Go context]. - Use
Valuefor optional arguments — Unreadable and untyped. Pass as a regular function parameter. - Ignore
ctx.Err()after<-ctx.Done()—Done()fires for both cancellation and timeout. Callctx.Err()to distinguish:context.Canceled→ 499 (client gone),DeadlineExceeded→ 504 (too slow).
Detecting context leaks in tests and production
The runtime.NumGoroutine() count is the cheap canary. Sample it before and after a test; non-zero delta = leak. The goleak package automates this:
import "go.uber.org/goleak"
func TestMain(m *testing.M) {
goleak.VerifyTestMain(m)
}
// per-test guard — fails the test if any goroutine outlives this function
func TestProcessOrder(t *testing.T) {
defer goleak.VerifyNone(t)
// ... test body
}In production, client_golang's default Go collector already exports go_goroutines for free — scrape that. To expose the count yourself (e.g. a custom registry without that collector), use a GaugeFunc with a name that can't collide with the built-in:
var goroutines = promauto.NewGaugeFunc(
prometheus.GaugeOpts{Name: "app_goroutines_live"},
func() float64 { return float64(runtime.NumGoroutine()) },
)Alert when the goroutine count grows monotonically over an hour — that pattern is almost always a forgotten defer cancel() or a goroutine waiting on a channel that nobody closes.
Cancellation propagation rules
A few rules that catch >90% of real-world bugs: [Go context]
- Always pass
ctxas the first parameter of any function that does I/O, blocks on a channel, or calls anotherctx-aware function. - Never wrap
ctxinside another struct with the goal of "saving keystrokes." Thecontainedctxlinter flags acontext.Contextheld in a struct field. (staticcheckSA1029 is a different check — it flags built-in types used asWithValuekeys, the rule behind the unexported-key-type advice above.) select { case <-ctx.Done(): }everywhere that you would otherwise block. Aforloop reading from a channel becomes:
for {
select {
case <-ctx.Done():
return ctx.Err()
case msg := <-msgs:
if err := process(ctx, msg); err != nil {
return err
}
}
}- Cancel-on-parent:
WithCancel(parent)cancels whenparentcancels — wire your tree so one root cancellation drains the whole stack. - Don't recreate roots inside the request path.
context.Background()belongs inmain()and a few cron entry points; everything downstream derives from the requestctx.
What ctx.Err() returns and why it matters
After <-ctx.Done() fires, call ctx.Err() to know why the context cancelled:
| Return value | Trigger | HTTP mapping | gRPC mapping |
|---|---|---|---|
nil | Context still live — Done() not yet closed (a closed Done() always yields a non-nil Err()) | n/a | n/a |
context.Canceled | Caller invoked cancel() or parent canceled | 499 (client closed request) | codes.Canceled |
context.DeadlineExceeded | Deadline reached (WithTimeout, WithDeadline) | 504 (gateway timeout) | codes.DeadlineExceeded |
The HTTP 499 status code is non-standard but supported by Nginx and most observability tools[Go context]. Use it to distinguish "client gave up" (499) from "we ran out of budget" (504) — both metrics matter for SLO tracking, and conflating them masks real availability problems.
For Go 1.20+, WithCancelCause(parent) and context.Cause(ctx) let you attach an error to the cancellation, which propagates through the chain. Use it sparingly — most code paths only care about the canceled-vs-timed-out distinction above.
Pick the right constructor for the symptom
When a context bug bites in production, the question is "what kind of cancellation do I want." Route by intent, not by API name:
graph TD
Need[I need to control<br/>a downstream operation] --> Q{Why am I<br/>cancelling?}
Q -->|Caller might give up| Cancel[WithCancel<br/>+ defer cancel<br/>good for fan-out]
Q -->|Operation has SLA| Timeout[WithTimeout<br/>+ defer cancel<br/>500ms RPC budget]
Q -->|Need budget split<br/>across multiple steps| Deadline[WithDeadline<br/>+ defer cancel<br/>parent budget minus elapsed]
Q -->|Carrying request data<br/>not cancellation| Value[WithValue<br/>request ID, user ID,<br/>trace span]
Q -->|Multiple cancellation<br/>causes| Cause[WithCancelCause<br/>Go 1.20 plus<br/>read with context.Cause]
Cancel --> Always[Always call cancel<br/>even on success path]
Timeout --> Always
Deadline --> Always
Cause --> Always
Value -.->|cancel propagates| Cancel
style Always fill:#dfd
style Value fill:#ffd
The diagram is the cheat-sheet for context bugs: every "why is my goroutine leaking" trace lands on a missing defer cancel() or a stored-in-struct context that outlived its request[Go context].
Production gotchas the linter will not catch
Static analysis flags the obvious leaks. The bugs that survive code review are subtler: a value lookup that quietly returns the zero value, a parent context replaced halfway through a request, a select that races with a closed channel. The next three patterns show up in real incidents and are worth memorising.
Type-assert with the comma-ok form, every time
The single-return form of a context value lookup will compile, run, and silently hand you the zero value when the key is absent or stored under a different type. The compiler cannot help you here because the return type of the lookup is the empty interface. The fix is the comma-ok assertion plus an explicit fallback path so a missing trace identifier becomes an obvious log line, not a hidden zero string in three downstream services.
type ctxKey int
const (
requestIDKey ctxKey = iota
tenantIDKey
)
// Bad: the assertion silently drops to "" if the key is missing or
// the value is the wrong type. Downstream logs show empty strings.
func badLookup(ctx context.Context) string {
return ctx.Value(requestIDKey).(string)
}
// Good: explicit comma-ok branch and a sentinel so a missing value
// is loud, not silent. Production telemetry can alert on the sentinel.
func RequestID(ctx context.Context) (string, bool) {
id, ok := ctx.Value(requestIDKey).(string)
if !ok || id == "" {
return "", false
}
return id, true
}
func handleOrder(ctx context.Context, orderID string) error {
rid, ok := RequestID(ctx)
if !ok {
// Loud failure rather than logs full of empty request IDs.
return fmt.Errorf("missing request id in ctx for order %s", orderID)
}
log.Printf("rid=%s order=%s", rid, orderID)
return nil
}The same rule applies to tenant identifiers, user identities, locale tags, and feature-flag bundles. Any helper that hides the bool return value behind a single string will eventually mask a wiring bug across a service boundary.
Wire the signal handler into the cancel chain, not next to it
A common shutdown bug is two parallel cancellation paths that never meet. The HTTP server takes one context, the signal handler takes another, and the worker pool reads from a third. When SIGTERM fires, the signal handler returns from main while in-flight handlers keep going on the original Background() and are killed mid-write by the runtime. The fix is to root every long-lived goroutine in a single signal.NotifyContext so one ctrl-C drains the entire tree.
func main() {
ctx, stop := signal.NotifyContext(context.Background(),
os.Interrupt, syscall.SIGTERM)
defer stop()
srv := &http.Server{
Addr: ":8080",
Handler: router(),
// BaseContext returns the parent for every incoming request,
// so SIGTERM cancels in-flight handler ctxs as well.
BaseContext: func(_ net.Listener) context.Context { return ctx },
}
go func() {
if err := srv.ListenAndServe(); err != nil &&
!errors.Is(err, http.ErrServerClosed) {
log.Fatalf("listen: %v", err)
}
}()
<-ctx.Done()
shutdownCtx, cancel := context.WithTimeout(
context.Background(), 25*time.Second)
defer cancel()
if err := srv.Shutdown(shutdownCtx); err != nil {
log.Printf("graceful shutdown failed: %v", err)
}
}signal.NotifyContext was added in Go 1.16 and is the only correct entry point for production servers. Pre-1.16 code that uses a manual signal.Notify channel plus a separate cancelFunc is the most common source of orphaned goroutines on shutdown — the channel fires, the handler exits, but no descendant context ever sees Done() close.
Propagate ctx through OpenTelemetry, not around it
Trace propagation needs the same context that cancels the request. Two patterns leak in production: starting a span with context.Background() so the trace stays disconnected from the parent, and storing a span pointer outside the context so cancellation cannot reach the export pipeline. The otelhttp middleware does the first half automatically; the second half is a discipline of always reading the span from the request context, not a struct field.
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/trace"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
func wireRouter(h http.Handler) http.Handler {
return otelhttp.NewHandler(h, "http.server")
}
func handleCheckout(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
tracer := otel.Tracer("checkout")
ctx, span := tracer.Start(ctx, "checkout.validate")
defer span.End()
// Pass the derived ctx everywhere — never start a child span
// from Background() or you orphan the trace.
if err := validate(ctx, r); err != nil {
span.RecordError(err)
http.Error(w, "invalid", http.StatusBadRequest)
return
}
if err := charge(ctx, r); err != nil {
span.RecordError(err)
http.Error(w, "payment failed", http.StatusBadGateway)
return
}
w.WriteHeader(http.StatusOK)
}Two follow-on rules. First, never call trace.SpanFromContext and then store the result in a struct that outlives the request — once the request context cancels, the span is closed and writes silently no-op. Second, when crossing into a goroutine that survives the request (a worker that sends a confirmation email asynchronously, for example), build a fresh root context from Background(), attach a span link with trace.WithLinks, and let the original request context die. That preserves the trace topology without binding the worker to the inbound HTTP deadline.
Choosing context.Context vs sync.WaitGroup
Both primitives coordinate goroutines, but they answer different questions. WaitGroup answers "is everyone done?" and is purely a counter. Context answers "should you stop?" and is purely a cancellation signal. Real services need both, and the rule of thumb is: use context to broadcast cancel, use WaitGroup to wait for clean exit.
| Question | Reach for | Why |
|---|---|---|
| When should this goroutine give up? | context.Context | Carries deadline, cancel signal, and request values |
| How do I know all spawned goroutines drained? | sync.WaitGroup | Counts up on Add, blocks on Wait until all Done |
| One goroutine errors — do the rest abort? | errgroup.Group | Wraps WaitGroup with first-error short-circuit and shared ctx |
| Bound concurrency to N workers | errgroup.SetLimit or buffered channel | Context alone has no concurrency limit |
| Long-running fan-out with per-task deadline | Context + errgroup per task | Parent ctx for global cancel, child ctx per task budget |
The most common confusion is using a WaitGroup to "wait for cancel," which it cannot do, or using only context for fan-in, which leaks because no goroutine actually reports completion. The errgroup.WithContext constructor merges both into a single API and is the right default for new fan-out code.
import "golang.org/x/sync/errgroup"
func fetchAll(parent context.Context, urls []string) ([][]byte, error) {
g, ctx := errgroup.WithContext(parent)
g.SetLimit(8) // bounded concurrency
out := make([][]byte, len(urls))
for i, u := range urls {
i, u := i, u // pin for closure
g.Go(func() error {
// Inherit the group's ctx so the first error cancels siblings.
req, _ := http.NewRequestWithContext(ctx, http.MethodGet, u, nil)
resp, err := http.DefaultClient.Do(req)
if err != nil {
return fmt.Errorf("fetch %s: %w", u, err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return err
}
out[i] = body
return nil
})
}
if err := g.Wait(); err != nil {
return nil, err
}
return out, nil
}The errgroup.WithContext ctx cancels automatically when the first goroutine returns a non-nil error. Sibling goroutines see ctx.Done() close and abort their HTTP calls cleanly. Without it you would have to write a manual WaitGroup, a shared context.WithCancel, and a sync.Once around the error capture — every line of that machinery is a place for a leak to hide.
Incident retrospective: the ten-minute payment outage
A real outage from a payments gateway, redacted. The symptom was a sudden p99 latency spike on /charge from 180 ms to 9 s, followed by a wave of 504s from the upstream load balancer. CPU was flat, memory was rising, and the goroutine count climbed from 2 thousand to 140 thousand in nine minutes before the on-call rolled back.
The root cause was a four-line change to a fraud-check helper. The previous version derived a five-second timeout from the request context. The change moved the timeout into a "shared" helper that took its own context root for "cleanliness," intending to prevent fraud checks from being killed when clients gave up. What the author missed: the helper now had no upstream cancel signal, so when the downstream fraud service stalled at three seconds per call, every in-flight request piled goroutines up forever, each holding an open Postgres transaction.
Three things made the incident recoverable. First, the goroutine gauge was already on the on-call dashboard and tripped the rate-of-change alert at 90 seconds. Second, the rollback path was a single revert because the change touched one file. Third, the post-incident test added a goleak.VerifyTestMain to the fraud package, which would have caught the orphaned context in CI before merge.
The four lessons that went into the team runbook:
- Helpers that own their own context root are forbidden in the request path. Every helper takes the caller's ctx and derives a child with
WithTimeoutonly. - The goroutine count gauge with a rate-of-change alert is the cheapest leading indicator of a context leak. Add it to every Go service before adding any other custom metric.
- Every package that spawns goroutines runs
goleak.VerifyTestMainin tests. The two-line addition catches at least one bug per quarter on a busy codebase. - Code review checklist gains one bullet: any new
context.Background()outsidemain, tests, or named cron entry points needs an explicit comment justifying it. No comment, no merge.
The incident took ten minutes to detect, two minutes to roll back, and the postmortem took an hour. The cost of the missing defer cancel() discipline is paid in those ten minutes every time it ships.
Frequently Asked Questions
When should I call cancel() if I'm deriving a context in a loop?
Call cancel() immediately after each step finishes, before the next iteration. Do not defer once per loop — it holds all cancels until the loop exits.
for _, url := range urls {
ctx, cancel := context.WithTimeout(parent, 1*time.Second)
body, err := fetch(ctx, url)
cancel() // release immediately
if err != nil { return err }
}Is context.TODO() safe to ship?
No. go vet will not flag it, but it's a signal to reviewers that a real context was not threaded. Replace it before merging to main.
Why shouldn't I store context in a struct like Handler.ctx?
Because the context is tied to a single request's lifetime. If you cache it in a struct that lives longer (like a handler instance), it will be cancelled after the first request, and all subsequent requests will use a dead context.
What's the difference between DeadlineExceeded and Canceled?
DeadlineExceeded means the deadline passed (timeout). Canceled means someone called cancel() or the parent context was canceled. Map them to different HTTP status codes: 504 and 499, respectively.
Constructor reference table
A quick comparison of the five constructors and when to reach for each:
| Constructor | When to use | Cancellation source | Must defer cancel()? |
|---|---|---|---|
Background() | Top of main, server startup | Never | No |
TODO() | Placeholder during refactor | Never | No |
WithCancel(parent) | Caller might give up early | Explicit cancel() call | Yes |
WithTimeout(parent, d) | RPC budget, request SLA | Timer expiry or cancel() | Yes |
WithDeadline(parent, t) | Splitting a parent budget | Deadline reached or cancel() | Yes |
WithValue(parent, k, v) | Carrying request-scoped data | Inherits parent | No |
WithCancelCause(parent) | Need cause attribution (Go 1.20+) | cancel(err) with cause | Yes |
Keep Reading
Engineering Team
Backend engineers writing production-grade references for Go, Java, and distributed systems.
Read Next
Go Context in Depth: Cancellation, Timeouts, and Debugging in Production
Master context.Context in Go: cancellation propagation, deadline inheritance, goroutine leak patterns, and debugging with pprof.
Go Worker Pool Pattern: Production-Ready Concurrency Control
Build a production-ready Go worker pool with goroutines and channels. Control concurrency and prevent resource exhaustion.
Kafka Producer Tuning Cheat Sheet: Throughput, Latency & Durability
Kafka producer configuration: acks, idempotence, batching, compression, and the tradeoffs that matter for throughput and durability.