Skip to content

Go Worker Pool Pattern: Production-Ready Concurrency Control

BackendBytes Engineering Team
BackendBytes Engineering Team
7 min read
Go Worker Pool Pattern: Production-Ready Concurrency Control

Key Takeaways

  • Unbounded goroutine spawning OOMs under burst traffic — goroutines pile up faster than they drain, stacks grow, memory crosses the container limit. A fixed-size worker pool eliminates it.
  • Size workers to the downstream bottleneck: runtime.NumCPU() for CPU-bound, DB pool size for I/O-bound, API rate limit for external calls
  • Use context.Context + select on ctx.Done() for graceful shutdown — workers finish current jobs then exit within a timeout

The classic Go production OOM incident. An image-processing service spawned one goroutine per upload request. A burst of 200 requests per second pushed concurrent goroutines into the hundreds of thousands. Memory crossed the container limit, the service OOMed, restarted, OOMed again in a tight loop. We debugged variants of this on multiple production services — the fix is always a bounded worker pool, never "buy more memory."

The Problem with Unbounded Goroutine Spawning

The classic Go OOM-under-load pattern: an image-processing service spawns one goroutine per upload request. Under burst traffic — a marketing campaign, a flash sale, a fan-out from another service — concurrent goroutine count climbs into the hundreds of thousands. Memory crosses the container limit, the service OOMs, restarts, OOMs again in a tight loop. We've debugged variants of this on multiple services.

The root cause: unbounded concurrency. Goroutines are cheap to start (a small initial stack, grown on demand)[Go Runtime GC], but cheap is not free. Under sustained load, spawning one per task means goroutine count grows as fast as arrival_rate − completion_rate. Tasks block on I/O — database queries, HTTP calls — and goroutines pile up faster than they drain. Their stacks grow, per-goroutine bookkeeping accumulates, and memory climbs until the service crashes.

Bottom Line

Fix unbounded concurrency with a fixed-size worker pool: a set of long-lived goroutines reading from a shared channel. Size workers to the downstream bottleneck, buffer the job channel for backpressure, and gracefully shut down with context cancellation.

  • Size to bottleneck: runtime.NumCPU() for CPU-bound, DB pool size for I/O-bound, API rate limit for external calls
  • Buffer the queue: 50–1000 jobs depending on task size and burstiness
  • Graceful shutdown: close job channel, wait on WaitGroup, apply context timeout
graph LR
    P1[Producer 1] --> Q[(Buffered<br/>job channel)]
    P2[Producer 2] --> Q
    P3[Producer N] --> Q
    Q -->|range| W1[Worker 1]
    Q -->|range| W2[Worker 2]
    Q -->|range| W3[Worker N]
    W1 --> R[(Results<br/>channel)]
    W2 --> R
    W3 --> R
    Ctx[ctx.Done] -.->|fan-out cancel| W1
    Ctx -.-> W2
    Ctx -.-> W3
    style Q fill:#eef
    style R fill:#efe
    style Ctx fill:#fee

The buffered job channel is the backpressure boundary: producers block when it's full instead of spawning unbounded goroutines. The worker count is the concurrency cap — sized to the downstream bottleneck, never to the producer rate. ctx.Done fans out cancel to every worker so shutdown is bounded and graceful.

The Quick Start: Sizing and Configuration

ScenarioWorker CountBuffer SizePatternRationale
Image resizingruntime.NumCPU()100–500CPU-boundAll workers should run; excess queue jobs wait
API aggregationCPU * 4500–2000I/O-boundWorkers block on HTTP; higher concurrency hides latency
Database batchDB pool size50–200Connection-limitedEach worker holds one DB connection
Email sendingProvider rate limit1000+Rate-limitedDecouple producer from SMTP throttle

Basic Worker Pool: The Core Pattern

The producer / channel / N-worker / WaitGroup topology — the entire pattern in one picture:

graph LR
    Producer[Producer<br/>spawns jobs] -->|jobs<-| JobChan[(jobs channel<br/>buffered)]
    JobChan --> W1[Worker 1<br/>goroutine]
    JobChan --> W2[Worker 2<br/>goroutine]
    JobChan --> W3[Worker N<br/>goroutine]
    W1 -->|results<-| ResultChan[(results channel)]
    W2 -->|results<-| ResultChan
    W3 -->|results<-| ResultChan
    Close[Producer:<br/>close jobs<br/>after last submit] -.->|signals<br/>shutdown| W1
    Close -.->|signals<br/>shutdown| W2
    Close -.->|signals<br/>shutdown| W3
    W1 -->|wg.Done| WG[sync.WaitGroup]
    W2 -->|wg.Done| WG
    W3 -->|wg.Done| WG
    WG -->|all done| CloseR[close results<br/>signal collector]
    style JobChan fill:#dfd
    style WG fill:#ffd

Three production rules visible in the topology: (1) the buffered jobs channel applies natural backpressure — producers block when workers cannot keep up; (2) close(jobs) is the workers' EOF signal — every worker exits its for-range loop when the channel drains; (3) the WaitGroup-then-close-results pattern ensures the collector sees all results before its for-range terminates.

package main
 
import (
	"context"
	"fmt"
	"log"
	"runtime"
	"sync"
	"time"
)
 
type Job struct {
	ID    int
	Value string
}
 
type Pool struct {
	workers int
	jobs    chan Job
	results chan string
	wg      sync.WaitGroup
	ctx     context.Context
	cancel  context.CancelFunc
}
 
func NewPool(workers, queueSize int) *Pool {
	ctx, cancel := context.WithCancel(context.Background())
	return &Pool{
		workers: workers,
		jobs:    make(chan Job, queueSize),
		results: make(chan string, queueSize),
		ctx:     ctx,
		cancel:  cancel,
	}
}
 
func (p *Pool) Start() {
	for i := 1; i <= p.workers; i++ {
		p.wg.Add(1)
		go p.worker(i)
	}
}
 
func (p *Pool) worker(id int) {
	defer p.wg.Done()
	for job := range p.jobs {
		// Process job with timeout
		result := fmt.Sprintf("worker %d completed job %d: %s", id, job.ID, job.Value)
		p.results <- result
	}
}
 
func (p *Pool) Submit(job Job) {
	p.jobs <- job
}
 
func (p *Pool) Close() {
	close(p.jobs)
	p.wg.Wait()
	close(p.results)
}
 
func main() {
	pool := NewPool(runtime.NumCPU(), 100)
	pool.Start()
 
	// Submit jobs
	go func() {
		for i := 1; i <= 50; i++ {
			pool.Submit(Job{ID: i, Value: fmt.Sprintf("task_%d", i)})
		}
		pool.Close()
	}()
 
	// Collect results
	for result := range pool.results {
		log.Println(result)
	}
}

A worker pool has four key pieces:

  • Jobs channel — buffered queue decoupling producers from consumers. When full, producers block (backpressure)[Go Language Specification].
  • Worker goroutine — reads from jobs in a for-range loop, processes each, sends results. Loop exits when jobs channel closes (the canonical "range over channel until closed" idiom)[Go Language Specification].
  • Results channel — aggregates output back to the main goroutine or collector. Send/receive synchronisation gives a happens-before relationship[Go Memory Model] — a value sent on a channel is observable to the receiver without additional locking.
  • WaitGroup — ensures main doesn't exit before all workers finish. wg.Wait() blocks after closing jobs.

Production-Ready: Context and Graceful Shutdown

[Go context]

A production pool needs context cancellation for timeouts and graceful shutdown:

func (p *Pool) worker(id int) {
	defer p.wg.Done()
	for {
		select {
		case job, ok := <-p.jobs:
			if !ok {
				return // jobs channel closed, exit gracefully
			}
			// Create job-specific timeout from global context
			jobCtx, cancel := context.WithTimeout(p.ctx, 30*time.Second)
			_ = p.processJob(jobCtx, job)
			cancel()
		case <-p.ctx.Done():
			// Forced shutdown (e.g., sigterm)
			return
		}
	}
}
 
func (p *Pool) Shutdown(shutdownTimeout time.Duration) {
	close(p.jobs) // no more jobs accepted
	done := make(chan struct{})
	go func() {
		p.wg.Wait()
		close(done)
	}()
	// Wait for graceful drain or timeout
	select {
	case <-done:
		// All workers finished
	case <-time.After(shutdownTimeout):
		// Timeout: force cancel remaining workers
		p.cancel()
		<-done
	}
}

Key patterns: close jobs channel (signals "no more work"), use WaitGroup to drain, apply context cancellation on timeout.

Choosing Your Concurrency Pattern

When to use a worker pool vs alternatives:

  • go func() per task: Only for low-volume, short-lived tasks. Will OOM under sustained load.
  • errgroup: Best for one-shot batch operations (e.g., "run these 5 API calls in parallel"). Not for continuous streaming.
  • Worker pool: Long-lived, high-throughput streaming. Kafka consumers, rate-limited API integrations, batch databases work.

errgroup is a pool for batches. A custom pool is for streams. Pick the right size tool.

The errgroup pattern for fan-out/fan-in batch jobs:

import "golang.org/x/sync/errgroup"
 
// Process N independent items with a hard concurrency cap;
// fail fast if any one fails.
func processBatch(ctx context.Context, items []Item) error {
    g, ctx := errgroup.WithContext(ctx)
    g.SetLimit(10) // bound concurrency — cap the in-flight work
 
    for _, item := range items {
        item := item // capture loop var (Go 1.22 makes this implicit)
        g.Go(func() error {
            select {
            case <-ctx.Done():
                return ctx.Err()
            default:
            }
            return processItem(ctx, item)
        })
    }
 
    // First error cancels the group's context; remaining workers
    // observe ctx.Done() and exit early.
    return g.Wait()
}

SetLimit is the critical knob — without it, errgroup happily spawns one goroutine per item and OOMs on a million-item batch. With it, errgroup IS a worker pool sized for one batch.

Decision rule: continuous stream → custom pool with channel + WaitGroup; bounded batch with hard parallelism cap → errgroup; "run these 3 calls in parallel" → errgroup; "drain a Kafka topic forever" → custom pool.

Handling Backpressure and Non-Blocking Submission

When the job queue fills, submissions block. For non-blocking submission with explicit backpressure:

func (p *Pool) SubmitNonBlocking(job Job) error {
	select {
	case p.jobs <- job:
		return nil
	default:
		return fmt.Errorf("job queue full (%d/%d)", len(p.jobs), cap(p.jobs))
	}
}

Callers then decide: retry, drop, apply circuit breaker, or slow down the producer.

Observability: Pool Metrics That Actually Help

A production worker pool needs three Prometheus metrics[Prometheus Best Practices]: queue depth, worker utilisation, and job duration. Without them, you cannot tell whether the pool is sized right or whether the bottleneck is upstream or downstream:

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)
 
var (
    queueDepth = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "worker_pool_queue_depth",
        Help: "Current number of jobs in the queue.",
    })
 
    activeWorkers = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "worker_pool_active_workers",
        Help: "Number of workers currently processing a job.",
    })
 
    jobDuration = promauto.NewHistogram(prometheus.HistogramOpts{
        Name:    "worker_pool_job_duration_seconds",
        Help:    "Job-handling duration distribution.",
        Buckets: prometheus.DefBuckets,
    })
)
 
func (p *Pool) worker(id int) {
    defer p.wg.Done()
    for job := range p.jobs {
        activeWorkers.Inc()
        timer := prometheus.NewTimer(jobDuration)
        if err := job.Handle(p.ctx); err != nil {
            slog.Error("job failed", "worker", id, "error", err)
        }
        timer.ObserveDuration()
        activeWorkers.Dec()
        queueDepth.Set(float64(len(p.jobs)))
    }
}

Alert rules to wire up:

MetricAlert conditionMeaning
queue_depth / queue_capacity > 0.8for 2 minutesPool is undersized — raise worker count or scale horizontally
active_workers == worker_countfor 5 minutesSaturation — every worker is busy, requests will start backpressuring
histogram_quantile(0.99, job_duration)> SLODownstream slowness leaking into pool latency

The pattern: tune worker count by watching queue depth, not by guessing. If queue depth hovers near zero, the pool is oversized and burning goroutine memory. If it hovers near capacity, the pool is undersized and the next traffic burst will OOM the buffer.


Production Checklist

  • Set worker count to match downstream bottleneck (CPU, DB pool, API rate limit)
  • Buffer job queue large enough to absorb bursts without blocking producers
  • Implement graceful shutdown: close jobs channel, wait on WaitGroup, apply timeout
  • Use context.Context for per-job timeouts and cancellation
  • Handle backpressure explicitly; don't let queue fill silently
  • Expose metrics: queue depth, jobs/second, error rate
  • Test pool under latency spikes (simulate slow downstream)
  • Load test to find real worker count sweet spot

Production worker pool with backpressure, metrics, and graceful drain

The pieces a worker-pool article usually leaves implicit but every production deployment needs — a typed Job interface, a bounded queue with rejection on overload (instead of unbounded buffering that turns latency into OOM), Prometheus metrics for queue depth and worker utilisation, and a drain path that respects context deadlines:

package workerpool
 
import (
    "context"
    "errors"
    "sync"
    "sync/atomic"
 
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)
 
// Job is the unit of work. Process must return on ctx cancellation.
type Job interface {
    Process(ctx context.Context) error
}
 
// Pool is a bounded worker pool with rejection-on-overload semantics.
type Pool struct {
    workers int
    jobs    chan Job
    wg      sync.WaitGroup
    mu      sync.RWMutex // serialises Submit's send against Shutdown's close
    closed  atomic.Bool
 
    queued   prometheus.Counter
    rejected prometheus.Counter
    inflight prometheus.Gauge
    duration prometheus.Histogram
}
 
func New(workers, queueSize int, ns string) *Pool {
    return &Pool{
        workers: workers,
        jobs:    make(chan Job, queueSize),
        queued:   promauto.NewCounter(prometheus.CounterOpts{Namespace: ns, Name: "wp_jobs_queued_total"}),
        rejected: promauto.NewCounter(prometheus.CounterOpts{Namespace: ns, Name: "wp_jobs_rejected_total"}),
        inflight: promauto.NewGauge(prometheus.GaugeOpts{Namespace: ns, Name: "wp_jobs_inflight"}),
        duration: promauto.NewHistogram(prometheus.HistogramOpts{
            Namespace: ns, Name: "wp_job_duration_seconds",
            Buckets: prometheus.ExponentialBuckets(0.01, 2, 10),
        }),
    }
}

The Submit method — the core production decision is "what do you do when the queue is full?" Buffer-and-block hides the latency until OOM; reject-and-fail surfaces it immediately with a typed error the caller can act on:

var ErrPoolFull = errors.New("worker pool queue full; rejecting")
var ErrPoolClosed = errors.New("worker pool closed; not accepting new jobs")
 
// Submit returns ErrPoolFull immediately if the queue is at capacity.
// Callers can then increment a metric, drop the job, or apply backpressure
// upstream — all of which are valid choices, but they MUST be the caller's
// choice. Hiding the rejection inside the pool removes the lever the caller
// needs to react.
func (p *Pool) Submit(j Job) error {
    // RLock pairs with Shutdown's Lock so a send can never race close(p.jobs).
    // The closed flag alone is a TOCTOU trap: it can flip between the check and
    // the send, panicking with "send on closed channel".
    p.mu.RLock()
    defer p.mu.RUnlock()
    if p.closed.Load() {
        return ErrPoolClosed
    }
    select {
    case p.jobs <- j:
        p.queued.Inc()
        return nil
    default:
        p.rejected.Inc()
        return ErrPoolFull
    }
}

Workers and graceful drain — the channel-close + WaitGroup pattern that drains in-flight jobs while refusing new ones:

func (p *Pool) Start(ctx context.Context) {
    for i := 0; i < p.workers; i++ {
        p.wg.Add(1)
        go p.worker(ctx)
    }
}
 
func (p *Pool) worker(ctx context.Context) {
    defer p.wg.Done()
    for job := range p.jobs {
        p.inflight.Inc()
        timer := prometheus.NewTimer(p.duration)
        _ = job.Process(ctx)   // log err in real code
        timer.ObserveDuration()
        p.inflight.Dec()
    }
}
 
// Shutdown closes the input channel, waits for workers, and respects ctx.
// Use a context.WithTimeout(parent, 30*time.Second) to bound the wait.
func (p *Pool) Shutdown(ctx context.Context) error {
    if !p.closed.CompareAndSwap(false, true) {
        return ErrPoolClosed
    }
    // Take the write lock so no Submit is mid-send when we close the channel.
    p.mu.Lock()
    close(p.jobs)
    p.mu.Unlock()
    done := make(chan struct{})
    go func() { p.wg.Wait(); close(done) }()
    select {
    case <-done:
        return nil
    case <-ctx.Done():
        return ctx.Err()
    }
}

A test that proves the rejection-on-overload semantics — without it, a refactor that buffer-and-blocks would silently regress your service's latency contract:

func TestPool_RejectsWhenFull(t *testing.T) {
    p := New(2, 4, "test_") // 2 workers, queue size 4
    p.Start(context.Background())
    defer p.Shutdown(context.Background())
 
    blocker := &slowJob{block: make(chan struct{})}
    // Fill workers + queue (2 + 4 = 6 jobs in flight or queued).
    for i := 0; i < 6; i++ {
        if err := p.Submit(blocker); err != nil {
            t.Fatalf("submit %d should succeed: %v", i, err)
        }
    }
    // 7th submit MUST reject — not block.
    if err := p.Submit(blocker); !errors.Is(err, ErrPoolFull) {
        t.Fatalf("expected ErrPoolFull, got %v", err)
    }
    close(blocker.block)
}

Comparing Go's Concurrency Primitives: Pool vs sync.Pool vs errgroup vs Library

Engineers reach for the wrong tool because the names overlap. sync.Pool is not a worker pool — it is a per-P object cache for reducing allocator pressure on hot objects (think: reusing 4KB byte buffers across HTTP handlers). It does not bound goroutine count, schedule jobs, or apply backpressure. Mixing the two in the same paragraph is the most common source of confusion in Go code reviews.

Here is how the four real options compare:

OptionBounds goroutines?Backpressure?Best forFailure mode if misused
go func() per taskNoNoShort-lived, low-volume workOOM under sustained load
sync.PoolNo (object reuse only)NoHot allocation paths (buffers, structs)Subtle data races if reset is wrong
errgroup.WithContext + SetLimit(n)Yes (per group)Implicit (g.Go blocks)Bounded batch with fail-fastUnbounded if SetLimit is forgotten
Hand-rolled channel + N workersYesYes (buffered chan)Continuous streaming workloadsManual lifecycle plumbing
panjf2000/ants and similar librariesYesYes (configurable)Multi-tenant pools, dynamic resizingHides queue depth from your metrics

Decision rule: continuous stream of jobs that arrive at unpredictable rates → hand-rolled pool. Bounded batch with a known item count where one failure should cancel siblings → errgroup with SetLimit. Multiple independent tenants sharing a single host → a third-party library (ants) so you do not reinvent dynamic resizing. Reusing allocations on a hot path → sync.Pool orthogonal to whichever scheduling primitive you picked.

When NOT to Use a Worker Pool

A worker pool is plumbing, and plumbing has cost — channel sends, scheduler context switches, lifecycle code, and two more goroutines to reason about per worker. Three workloads where the simpler alternative is correct:

Pure CPU-bound work on a single tenant with a known input size. Wrap a parallel-for with errgroup.SetLimit(runtime.GOMAXPROCS(0)) and stop. No pool, no channels, no shutdown drain — just N goroutines that finish.

func resizeAll(ctx context.Context, paths []string) error {
    g, ctx := errgroup.WithContext(ctx)
    g.SetLimit(runtime.GOMAXPROCS(0))
    for _, p := range paths {
        p := p
        g.Go(func() error { return resize(ctx, p) })
    }
    return g.Wait()
}

Fan-out under a request context where every child must complete before the parent returns. A worker pool decouples lifetimes; here, you want them coupled. If the request is cancelled, every child should die immediately — and that is exactly what errgroup.WithContext gives you for free. Pushing this through a long-lived pool means orphan jobs continue executing after the HTTP handler has already returned 499.

Workloads with strict ordering or per-key serialisation. A single-worker channel (or one goroutine per key) is correct; a pool reorders work by definition. If two updates to the same user must apply in submission order, a 16-worker pool will race them.

Production Case Study: The Pool That Made the OOM Worse

A payments team I worked with hit a familiar incident: their notification worker pool was OOMing every 40 minutes under peak load. The instinct was to "give the pool more headroom" — they raised the buffered channel from 1,000 to 50,000 jobs. The OOMs got worse and arrived faster.

The trap: a larger queue does not add capacity. It adds latency you cannot see until memory runs out. With 50,000 queued jobs each holding a 4KB payload plus referenced context state, the queue alone weighed 200MB+, while the actual processing rate was unchanged. When the producer was a Kafka consumer reading at 20k msg/s and the workers drained at 8k msg/s, the queue filled in 4 seconds and stayed pinned at the new ceiling — except now the ceiling was 50× higher, so OOM came 50× more violently.

The fix was the opposite direction: shrink the queue and reject on full. Replace silent buffering with a typed ErrPoolFull, surface the rejection rate as a Prometheus counter, and let the upstream Kafka consumer pause partition reads when rejections climb:

func (p *Pool) SubmitOrReject(ctx context.Context, j Job) error {
    select {
    case p.jobs <- j:
        p.queued.Inc()
        return nil
    case <-ctx.Done():
        return ctx.Err()
    default:
        p.rejected.Inc()
        return ErrPoolFull
    }
}
 
// Upstream consumer reacts to rejection by pausing the partition
// instead of accumulating more pressure on the pool.
func consumeLoop(ctx context.Context, c *kafka.Consumer, p *Pool) {
    for msg := range c.Messages() {
        if err := p.SubmitOrReject(ctx, jobFor(msg)); errors.Is(err, ErrPoolFull) {
            c.Pause([]kafka.TopicPartition{msg.TopicPartition})
            time.Sleep(50 * time.Millisecond)
            c.Resume([]kafka.TopicPartition{msg.TopicPartition})
            continue
        }
    }
}

After the change, the pool stayed at a steady 1k queue depth, rejection rate became the operational signal (alert at >1% sustained), and the OOMs disappeared. The lesson generalises: a buffered channel hides the latency-to-OOM curve; a bounded queue with rejection forces the contract into the open. [Netflix concurrency-limits]

Frequently Asked Questions

How many workers should a Go worker pool have?

Size the pool to match the downstream bottleneck: use runtime.NumCPU() for CPU-bound work, the database connection pool size for DB-bound work, or the external API rate limit for network-bound work.

What is the difference between a worker pool and a semaphore in Go?

A worker pool uses long-lived goroutines reading from a shared channel, decoupling producers from consumers. A semaphore (chan struct) bounds concurrency but spawns a new goroutine per task, which is simpler but doesn't provide backpressure or task queuing.

How do you gracefully shut down a Go worker pool?

Close the jobs channel to signal no more work, then use sync.WaitGroup to wait for all workers to finish their current tasks. For immediate shutdown, pass a context.Context and select on ctx.Done() alongside the jobs channel.

When should you use errgroup instead of a worker pool in Go?

Use errgroup for fan-out/fan-in where you run N tasks concurrently and wait for all to complete. Use a worker pool when tasks arrive continuously and you need sustained concurrency control with backpressure.

Keep Reading

BackendBytes Engineering Team
BackendBytes Engineering Team

Engineering Team

A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.

Read Next