#ai #llm #go #api-design #observability #performance #production #prompt-engineering

LLM API Integration Patterns for Backend Engineers

BackendBytes Engineering Team

Mar 1, 2026

14 min read

LLM API Integration Patterns for Backend Engineers

Part of Series: AI Engineering in Production

Lesson 3 of 6

Prev Next

Key Takeaways

→Building a provider abstraction from day one lets you swap models and layer retry/fallback/cost tracking without touching business logic
→Streaming responses directly to clients with `X-Accel-Buffering: no` prevents reverse-proxy buffering from turning incremental delivery into buffered waits
→Pre-count tokens before API calls to catch context-window overflows client-side before paying for 400 errors; a token counter costs microseconds, a failed API call costs milliseconds
→Exponential backoff respecting `Retry-After` headers beats fixed-interval retries for handling provider rate limits and transient failures
→Cost circuit breaker on monthly spend prevents a single errant loop from turning $100/mo into $10K/mo — set the kill switch at 90% of budget

Every LLM API tutorial is five lines of code that will fail in production in at least six different ways.

The easy part is the API call. The hard part is retrying on rate limits, streaming responses without buffering, function calling with validation, token budgets to avoid truncation, cost tracking to prevent runaway spend (the OWASP LLM10 unbounded-consumption risk^{[OWASP LLM Top 10]}), and observability to diagnose failures.

TL;DR

Build a provider abstraction from day one, wrap it with retry and fallback logic, count tokens before every call, stream responses instead of buffering, track costs against a monthly budget, and instrument with structured logs and metrics. Real production code in Go — all patterns apply equally to Anthropic, OpenAI, and any LLM provider.

Provider interface: Single seam for swapping models and implementing cross-cutting concerns (retry, fallback, cost tracking, tracing)
Streaming + SSE: Deliver responses incrementally; flush to clients immediately with X-Accel-Buffering: no to prevent reverse-proxy buffering
Token budgets: Pre-count tokens before calling the API; catch context-window overflows client-side before paying for 400 errors
Cost circuit breaker: Record spend per model via Prometheus^{[Prometheus Best Practices]}; trigger kill switch at 90% monthly threshold to prevent surprise bills
Retry + fallback: Exponential backoff respecting Retry-After headers; fall back to cheaper models when primary is rate-limited or slow

graph LR
    BL[Business logic:<br/>tools, prompts, orchestration] --> Obs[Observable wrapper:<br/>traces, metrics, cost]
    Obs --> Fb[Fallback chain:<br/>GPT-4o → Sonnet → mini]
    Fb --> Rt[Retry wrapper:<br/>exp backoff, Retry-After]
    Rt --> P[Provider interface:<br/>Complete + Stream]
    P --> Imp1[OpenAI impl]
    P --> Imp2[Anthropic impl]
    P --> Imp3[Ollama impl]
    Cost[Cost circuit breaker<br/>$/month kill switch] -.->|cuts off| Obs
    Tok[Token pre-counter] -.->|gates| BL
    style BL fill:#eef
    style Cost fill:#fee
    style Tok fill:#fee

The diagram is the layered architecture in one picture: business logic talks to a single Provider interface, with observable / fallback / retry wrappers stacked between. Cost circuit-breaker and token pre-counter are the kill switches that protect the wallet from runaway loops — they're not part of the call path; they cut into it from outside.

The Quick Start: Integration Pattern Architecture

Every production LLM integration needs five layers. Build them in order; each is independently useful.

Layer	Purpose	Example Pattern
Base provider	HTTP client, connection pooling, provider-specific marshaling	OpenAI, Anthropic, Ollama
Retry wrapper	Exponential backoff with jitter; respects `Retry-After` headers	429, 500–503, network timeouts
Fallback chain	Try primary model, then cheaper alternatives on failure	GPT-4o → Claude Sonnet → GPT-4o-mini
Observable wrapper	Traces, metrics, cost tracking, structured logs	OTel spans, Prometheus counters, JSON logs
Business logic	Tool calling, prompt templates, streaming orchestration	Handler, Orchestrator, tool registry

Build the base provider and retry logic first. Add streaming when latency matters. Layer cost tracking before production. Build prompts as a registry as volume grows.

Provider Abstraction

The first mistake teams make is scattering openai.NewClient() calls across their codebase. When you need to add retry logic, switch to Claude for a rate-limited outage, or track costs, you're editing dozens of files.

Start with a single interface that abstracts provider differences. This seam becomes the place where you compose retry, fallback, observability, and cost tracking without touching business logic:

package llm
 
import (
	"context"
	"io"
)
 
type Provider interface {
	Complete(ctx context.Context, req *CompletionRequest) (*CompletionResponse, error)
	Stream(ctx context.Context, req *CompletionRequest) (StreamReader, error)
}
 
type StreamReader interface {
	Next() (StreamChunk, error)
	Close() error
}
 
type CompletionRequest struct {
	// JSON tags are required: without them json.Marshal emits "Model"/"MaxTokens"
	// and the API 400s. These field names (model, messages, max_tokens, stream …)
	// are shared by OpenAI and Anthropic, so the type stays provider-neutral;
	// provider-specific quirks belong in each provider's request mapping.
	Model       string           `json:"model"`
	Messages    []Message        `json:"messages"`
	Tools       []ToolDefinition `json:"tools,omitempty"`
	MaxTokens   int              `json:"max_tokens,omitempty"`
	Temperature float64          `json:"temperature,omitempty"`
	Stream      bool             `json:"stream,omitempty"`
}
 
type CompletionResponse struct {
	Content    string
	ToolCalls  []ToolCall
	Usage      TokenUsage
	Model      string
	FinishReason string
}
 
type TokenUsage struct {
	PromptTokens     int
	CompletionTokens int
	TotalTokens      int
}

Concrete OpenAI implementation with HTTP/2 pooling:

type OpenAIProvider struct {
	apiKey     string
	baseURL    string
	httpClient *http.Client
}
 
func NewOpenAIProvider(apiKey string) *OpenAIProvider {
	return &OpenAIProvider{
		apiKey:  apiKey,
		baseURL: "https://api.openai.com/v1",
		httpClient: &http.Client{
			Timeout: 120 * time.Second,
			Transport: &http.Transport{
				MaxIdleConns:        100,
				MaxIdleConnsPerHost: 20,
				IdleConnTimeout:     90 * time.Second,
				ForceAttemptHTTP2:   true, // HTTP/2 multiplexing
			},
		},
	}
}
 
func (p *OpenAIProvider) Complete(ctx context.Context, req *CompletionRequest) (*CompletionResponse, error) {
	body, err := json.Marshal(req)
	if err != nil {
		return nil, fmt.Errorf("marshal request: %w", err)
	}
	httpReq, err := http.NewRequestWithContext(ctx, http.MethodPost,
		p.baseURL+"/chat/completions", bytes.NewReader(body))
	if err != nil {
		return nil, fmt.Errorf("build request: %w", err)
	}
	httpReq.Header.Set("Authorization", "Bearer "+p.apiKey)
	httpReq.Header.Set("Content-Type", "application/json")
 
	resp, err := p.httpClient.Do(httpReq)
	if err != nil {
		return nil, fmt.Errorf("execute: %w", err)
	}
	defer resp.Body.Close()
 
	if resp.StatusCode != http.StatusOK {
		return nil, parseAPIError(resp)
	}
 
	var result openAIResponse
	if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
		return nil, fmt.Errorf("decode response: %w", err)
	}
	return result.toCompletionResponse(), nil
}

Always set ForceAttemptHTTP2: true—without it, each concurrent request opens a new connection, causing TLS handshake overhead and pool exhaustion under load.

Streaming via SSE

Non-streaming completions buffer the entire response in memory, making users wait 10–30 seconds watching a spinner. Streaming via Server-Sent Events (SSE) is the difference between "thinking…" and watching tokens arrive in real time.

Implement a StreamReader that yields chunks as they arrive, then proxy each chunk directly to the client with an immediate flush. This pattern works the same way for OpenAI, Anthropic, or any provider that supports streaming:

func (p *OpenAIProvider) Stream(ctx context.Context, req *CompletionRequest) (StreamReader, error) {
	req.Stream = true
	body, err := json.Marshal(req)
	if err != nil {
		return nil, fmt.Errorf("marshal request: %w", err)
	}
	httpReq, err := http.NewRequestWithContext(ctx, http.MethodPost,
		p.baseURL+"/chat/completions", bytes.NewReader(body))
	if err != nil {
		return nil, fmt.Errorf("build request: %w", err)
	}
	httpReq.Header.Set("Authorization", "Bearer "+p.apiKey)
	httpReq.Header.Set("Accept", "text/event-stream")
 
	resp, err := p.httpClient.Do(httpReq)
	if err != nil {
		return nil, fmt.Errorf("execute: %w", err)
	}
	if resp.StatusCode != http.StatusOK {
		resp.Body.Close()
		return nil, parseAPIError(resp)
	}
 
	scanner := bufio.NewScanner(resp.Body)
	scanner.Buffer(make([]byte, 0, 64*1024), 1024*1024)
	return &openAIStreamReader{scanner: scanner, resp: resp}, nil
}
 
func (r *openAIStreamReader) Next() (StreamChunk, error) {
	for r.scanner.Scan() {
		line := r.scanner.Text()
		if line == "" || !strings.HasPrefix(line, "data: ") {
			continue
		}
		data := strings.TrimPrefix(line, "data: ")
		if data == "[DONE]" {
			return StreamChunk{}, io.EOF
		}
		var chunk openAIStreamChunk
		json.Unmarshal([]byte(data), &chunk)
		return chunk.toStreamChunk(), nil
	}
	return StreamChunk{}, io.EOF
}

Proxy to HTTP clients with immediate flush:

func (h *ChatHandler) StreamCompletion(w http.ResponseWriter, r *http.Request) {
	flusher := w.(http.Flusher)
	w.Header().Set("Content-Type", "text/event-stream")
	w.Header().Set("Cache-Control", "no-cache")
	w.Header().Set("X-Accel-Buffering", "no") // disable nginx buffering
 
	stream, _ := h.provider.Stream(r.Context(), parseRequest(r))
	defer stream.Close()
 
	for {
		chunk, err := stream.Next()
		if err == io.EOF {
			fmt.Fprintf(w, "data: [DONE]\n\n")
			flusher.Flush()
			return
		}
		data, _ := json.Marshal(chunk)
		fmt.Fprintf(w, "data: %s\n\n", data)
		flusher.Flush()
	}
}

The X-Accel-Buffering: no header is critical—without it, nginx buffers the entire response before sending it to the client, silently breaking the streaming experience without any error.

Retry with Exponential Backoff

The retry decision tree — route by error type, never blindly:

graph TD
    Err[LLM API error] --> Type{HTTP status?}
    Type -->|429 Rate limit| Header{Retry-After<br/>header set?}
    Header -->|Yes| Wait[Sleep header value<br/>then retry]
    Header -->|No| Backoff[Exponential backoff<br/>+ full jitter<br/>cap at max_delay]
    Type -->|500, 502, 503, 504| Backoff
    Type -->|Network timeout<br/>or connection reset| Backoff
    Type -->|400 with content_filter| NoRetry[NEVER retry<br/>same prompt fails again<br/>log + return error]
    Type -->|400 invalid_request| NoRetry
    Type -->|401 / 403 auth| NoRetry
    Backoff --> Counter{Attempts<br/>under max?}
    Counter -->|Yes| Try[Retry call]
    Counter -->|No — exhausted| Final[Return final error<br/>open circuit breaker<br/>fall back to cache or<br/>graceful degraded response]
    Wait --> Counter
    Try --> Type
    style NoRetry fill:#fdd
    style Final fill:#fdd
    style Wait fill:#dfd
    style Backoff fill:#ffd

LLM APIs fail in well-understood ways: 429 (rate limit), 500–503 (transient server errors), network timeouts, and occasional malformed responses. The retry strategy is not to blindly retry everything, but to distinguish retryable failures from permanent ones.

Rate limits (429) respond with a Retry-After header telling you exactly how long to wait. Server errors (500, 503) are usually transient—wait a few seconds and try again. But invalid requests (400), auth failures (401/403), and content filter rejections (400 with specific error message) should not be retried.

Wrap the provider with retry logic that respects these distinctions:

type RetryProvider struct {
	inner  Provider
	config RetryConfig
	logger *slog.Logger
}
 
type RetryConfig struct {
	MaxRetries     int
	InitialBackoff time.Duration
	MaxBackoff     time.Duration
}
 
func (r *RetryProvider) Complete(ctx context.Context, req *CompletionRequest) (*CompletionResponse, error) {
	var lastErr error
 
	for attempt := 0; attempt <= r.config.MaxRetries; attempt++ {
		if attempt > 0 {
			backoff := r.calculateBackoff(attempt, lastErr)
			r.logger.Info("retrying", "attempt", attempt, "backoff", backoff)
			select {
			case <-ctx.Done():
				return nil, ctx.Err()
			case <-time.After(backoff):
			}
		}
 
		resp, err := r.inner.Complete(ctx, req)
		if err == nil {
			return resp, nil
		}
 
		lastErr = err
		if !isRetryable(err) {
			return nil, err // fail immediately on non-retryable errors
		}
	}
 
	return nil, fmt.Errorf("exhausted %d retries: %w", r.config.MaxRetries, lastErr)
}
 
func (r *RetryProvider) calculateBackoff(attempt int, err error) time.Duration {
	// Respect Retry-After header
	if apiErr, ok := err.(*APIError); ok && apiErr.RetryAfter > 0 {
		return apiErr.RetryAfter
	}
 
	// Exponential backoff with full jitter
	base := float64(r.config.InitialBackoff) * math.Pow(2, float64(attempt-1))
	if base > float64(r.config.MaxBackoff) {
		base = float64(r.config.MaxBackoff)
	}
	return time.Duration(rand.Float64() * base)
}
 
func isRetryable(err error) bool {
	var apiErr *APIError
	if errors.As(err, &apiErr) {
		switch apiErr.StatusCode {
		case 429, 500, 502, 503, 504:
			return true
		}
		return false
	}
	return !errors.Is(err, context.Canceled)
}

When the primary model exhausts retries (or is consistently slow), fall back to a cheaper alternative. Don't fail the request—degrade gracefully:

type FallbackProvider struct {
	chain []ProviderWithModel
}
 
type ProviderWithModel struct {
	Provider Provider
	Model    string
}
 
func (f *FallbackProvider) Complete(ctx context.Context, req *CompletionRequest) (*CompletionResponse, error) {
	var lastErr error
	for _, pm := range f.chain {
		reqCopy := *req
		reqCopy.Model = pm.Model
		resp, err := pm.Provider.Complete(ctx, &reqCopy)
		if err == nil {
			return resp, nil
		}
		lastErr = err
	}
	return nil, fmt.Errorf("all providers failed: %w", lastErr)
}

Typical fallback chain: GPT-4o (best) → Claude Sonnet (if OpenAI is rate-limited) → GPT-4o-mini (cheapest, last resort). Each provider is wrapped with its own retry logic; the fallback chain lets you survive outages by switching models transparently.

Token Budgets

LLM context windows are finite. GPT-4o has 128K; Claude has 200K. But add RAG context, conversation history, tool definitions, and system prompts—the tokens add up fast. Exceeding the limit silently truncates input or returns a 400 error.

Count tokens before making the API call. Catch context-window overflows client-side before paying for 400 errors:

import "github.com/pkoukk/tiktoken-go"
 
type TokenCounter struct {
	encodings map[string]*tiktoken.Tiktoken
}
 
func (tc *TokenCounter) CountMessages(model string, messages []Message) (int, error) {
	const perMessage = 4 // <im_start>{role}\n...content<im_end>\n
	const replyPriming = 2
 
	total := replyPriming
	for _, msg := range messages {
		enc, _ := tiktoken.EncodingForModel(model)
		tokens := enc.Encode(msg.Content, nil, nil)
		total += len(tokens) + perMessage
	}
	return total, nil
}

When conversation exceeds budget, use a sliding window strategy: keep the system prompt (always), drop oldest user/assistant turns first, and preserve recent context:

func SlidingWindow(messages []Message, maxTokens int, counter *TokenCounter, model string) []Message {
	var system, conversation []Message
	for _, m := range messages {
		if m.Role == "system" {
			system = append(system, m)
		} else {
			conversation = append(conversation, m)
		}
	}
 
	result := append([]Message(nil), system...)
	budget := maxTokens
	systemTokens, _ := counter.CountMessages(model, system)
	budget -= systemTokens
 
	// Walk newest-to-oldest, keeping each turn that still fits the budget.
	// Prepend kept turns into `kept` so chronological order is preserved;
	// never mutate `conversation` while iterating it.
	var kept []Message
	for i := len(conversation) - 1; i >= 0 && budget > 0; i-- {
		msgTokens, _ := counter.CountMessages(model, []Message{conversation[i]})
		if msgTokens > budget {
			break
		}
		kept = append([]Message{conversation[i]}, kept...)
		budget -= msgTokens
	}
	return append(result, kept...)
}

Simple and predictable—works well for chat where recent context matters most. The downside: important information from early in the conversation is lost. For tool-calling workflows, you can also use prioritization (system prompt + tool results are critical; older assistant messages are lowest priority).

Cost Tracking

^{[OWASP LLM Top 10]}

Without cost tracking, a runaway loop (model endlessly retrying, RAG context too large, long conversation histories) will surprise you with a $2,000 bill before you notice. Record token usage per model and set a monthly budget with a kill switch:

Track token usage per model and alert at 90% of monthly budget: ^{[OWASP LLM Top 10]}

var (
	tokenUsage = promauto.NewCounterVec(prometheus.CounterOpts{
		Name: "llm_tokens_total",
	}, []string{"model", "type"})
 
	requestCost = promauto.NewCounterVec(prometheus.CounterOpts{
		Name: "llm_cost_dollars_total",
	}, []string{"model"})
)
 
var pricing = map[string]struct{ Input, Output float64 }{
	"gpt-4o":         {2.50, 10.00},
	"gpt-4o-mini":    {0.15, 0.60},
	"claude-sonnet":  {3.00, 15.00},
	"claude-haiku":   {1.00, 5.00},
}
 
type CostTracker struct {
	monthlyBudget float64
	currentSpend  float64
	periodMonth   time.Month // resets currentSpend when the calendar month rolls
	mu            sync.Mutex
	killSwitch    func()
}
 
func (ct *CostTracker) Record(model string, usage TokenUsage) {
	price := pricing[model]
	cost := (float64(usage.PromptTokens)/1_000_000)*price.Input +
		(float64(usage.CompletionTokens)/1_000_000)*price.Output
 
	tokenUsage.WithLabelValues(model, "prompt").Add(float64(usage.PromptTokens))
	tokenUsage.WithLabelValues(model, "completion").Add(float64(usage.CompletionTokens))
	requestCost.WithLabelValues(model).Add(cost)
 
	ct.mu.Lock()
	defer ct.mu.Unlock()
 
	// Roll the window each calendar month so "monthly budget" is real — without
	// this, currentSpend only ever grows and the kill switch stays tripped after
	// month one. This counter is per-replica; the authoritative cross-fleet
	// monthly figure is sum(requestCost) in Prometheus, so drive fleet-wide
	// enforcement off that (or a shared store) rather than this local total.
	if m := time.Now().UTC().Month(); m != ct.periodMonth {
		ct.periodMonth = m
		ct.currentSpend = 0
	}
 
	ct.currentSpend += cost
	if ct.currentSpend >= ct.monthlyBudget && ct.killSwitch != nil {
		slog.Error("LLM budget exhausted")
		ct.killSwitch()
	}
}

Function Calling

Modern LLM APIs let models request tool execution instead of just generating text. The model says "I need to look up an order" and you execute order_lookup, then feed the result back to the model so it can generate a final response.

The key principle: always validate tool arguments against a schema before execution. The model generates structured JSON, but it can hallucinate invalid order IDs, SQL injection attempts, or out-of-range values. Never trust the model's JSON—parse it, validate it, execute it safely.

Define a registry mapping tool names to handlers and schemas:

type ToolDefinition struct {
	Name        string          `json:"name"`
	Description string          `json:"description"`
	Parameters  json.RawMessage `json:"parameters"`
}
 
type ToolCall struct {
	ID        string
	Name      string
	Arguments string
}
 
type Registry struct {
	tools    map[string]ToolDefinition
	handlers map[string]func(context.Context, json.RawMessage) (string, error)
}
 
func (r *Registry) Execute(ctx context.Context, call ToolCall) string {
	handler, ok := r.handlers[call.Name]
	if !ok {
		return "unknown tool"
	}
 
	var args json.RawMessage
	if err := json.Unmarshal([]byte(call.Arguments), &args); err != nil {
		return "invalid arguments"
	}
 
	// 30-second timeout on tool execution
	toolCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
	defer cancel()
 
	result, err := handler(toolCtx, args)
	if err != nil {
		return "execution failed: " + err.Error()
	}
	return result
}

Production Checklist

Provider abstraction: All calls via a single interface, not scattered openai.NewClient() calls
HTTP/2 enabled: ForceAttemptHTTP2: true on your transport
Retries with backoff: Exponential backoff respecting Retry-After headers; max 3 retries
Fallback chain: Primary model + 2–3 cheaper alternatives
Token counting: Pre-count before every call; catch context-window overflows client-side
Cost tracking: Prometheus metrics per model; monthly budget with kill switch at 90%
Streaming with flush: SSE chunks flushed immediately; X-Accel-Buffering: no on proxies
Tool validation: Validate tool arguments against a schema before execution
Observability: OTel spans capturing model, tokens, duration, cost; structured logs with sampling
Timeout safety: 120s on HTTP client, 30s on tool execution
Prompt templates: Store as versioned templates, not hardcoded strings; snapshot-test golden files

Streaming with Backpressure

The naive SSE handler shown earlier is correct for fast clients. It breaks the moment a client is slow — a mobile network on a train, a tab in a backgrounded browser, a curl reader piping through less. The provider stream arrives at hundreds of tokens per second; the client drains at twenty. Without backpressure, the bytes pile up in the goroutine's send buffer, the goroutine never blocks, and a single stuck client can hold the upstream connection open for the full provider timeout while consuming heap. Multiply by a few thousand connections and the pod gets OOM-killed.

The fix is to honour the slowest party in the chain. Wrap the writer in a context-aware bounded channel, drop the request if the client cannot keep up, and always release the upstream provider connection on disconnect. Three rules, in order: detect the disconnect early, bound the buffer, and abort the upstream read when the client goes away.

type backpressuredWriter struct {
	w        http.ResponseWriter
	flusher  http.Flusher
	deadline time.Duration
}
 
func (bw *backpressuredWriter) writeChunk(ctx context.Context, chunk StreamChunk) error {
	data, err := json.Marshal(chunk)
	if err != nil {
		return fmt.Errorf("marshal chunk: %w", err)
	}
 
	// Per-write deadline detects slow clients without blocking forever.
	// SetWriteDeadline is exposed via the http.ResponseController in Go 1.20+.
	rc := http.NewResponseController(bw.w)
	if err := rc.SetWriteDeadline(time.Now().Add(bw.deadline)); err != nil {
		return fmt.Errorf("set deadline: %w", err)
	}
 
	if _, err := fmt.Fprintf(bw.w, "data: %s\n\n", data); err != nil {
		return fmt.Errorf("write: %w", err) // client disconnected or stalled
	}
	bw.flusher.Flush()
 
	select {
	case <-ctx.Done():
		return ctx.Err()
	default:
		return nil
	}
}

A 10-second per-chunk deadline catches a wedged TCP socket without making the happy path slower. When the deadline fires, Fprintf returns an error, the handler unwinds, and the deferred stream.Close() aborts the upstream HTTP request to OpenAI or Anthropic — releasing the file descriptor and the API quota immediately rather than letting it idle until the 120-second client timeout.

The handler that drives this writer needs to run the read and write on separate goroutines so that an upstream pause cannot starve the disconnect detector. Use a bounded channel as the queue between them — when it fills, the producer goroutine blocks, and the provider's TCP receive window naturally throttles. That pushes backpressure all the way to the LLM provider's edge, which is exactly where you want it.

func (h *ChatHandler) StreamWithBackpressure(w http.ResponseWriter, r *http.Request) {
	flusher, ok := w.(http.Flusher)
	if !ok {
		http.Error(w, "streaming unsupported", http.StatusInternalServerError)
		return
	}
	w.Header().Set("Content-Type", "text/event-stream")
	w.Header().Set("Cache-Control", "no-cache")
	w.Header().Set("X-Accel-Buffering", "no")
 
	ctx, cancel := context.WithCancel(r.Context())
	defer cancel()
 
	stream, err := h.provider.Stream(ctx, parseRequest(r))
	if err != nil {
		http.Error(w, "upstream failed", http.StatusBadGateway)
		return
	}
	defer stream.Close()
 
	bw := &backpressuredWriter{w: w, flusher: flusher, deadline: 10 * time.Second}
	queue := make(chan StreamChunk, 16) // bounded; fills if client is slow
 
	go func() {
		defer close(queue)
		for {
			chunk, err := stream.Next()
			if errors.Is(err, io.EOF) {
				return
			}
			if err != nil {
				h.logger.Warn("upstream read failed", "err", err)
				return
			}
			select {
			case queue <- chunk:
			case <-ctx.Done():
				return // client disconnected — stop reading from provider
			}
		}
	}()
 
	for chunk := range queue {
		if err := bw.writeChunk(ctx, chunk); err != nil {
			h.logger.Info("client disconnected", "err", err)
			cancel() // unblock the producer goroutine
			return
		}
	}
	fmt.Fprintf(w, "data: [DONE]\n\n")
	flusher.Flush()
}

The buffer size of 16 chunks is deliberate. Smaller and a brief client pause stalls the producer; larger and the heap grows with every stuck connection. At sixteen chunks of roughly 200 bytes, a thousand parallel slow clients cost about 3 MB of buffered data — bounded and predictable. Track the queue depth as a Prometheus histogram and alert when the p99 sits at capacity, which means clients are systematically slower than the provider and you should consider an admission gate or a reduced concurrency limit. The same pattern applies to Anthropic's Messages stream and any provider that returns SSE.

Frequently Asked Questions

How do you handle LLM API rate limits in production?

Implement exponential backoff with jitter on 429 responses, use a circuit breaker to fall back to a cheaper model when the primary provider is rate-limited or degraded, and track token usage per request to stay within provider quotas. Pre-count tokens before sending to avoid silent context window truncation.

How do you stream LLM responses without buffering the full completion?

Use the provider's streaming API (SSE) and implement a StreamReader interface that delivers tokens incrementally. Pipe the stream directly through your HTTP response using chunked transfer encoding so neither your server nor the client waits for the full completion to finish.

How do you manage LLM API costs in production?

Track token usage per request with Prometheus metrics, set per-user and per-tenant token budgets, use cheaper models for simple tasks and reserve expensive models for complex ones, and implement a cost circuit breaker that alerts or throttles when spend exceeds thresholds.

Should you use one LLM provider or abstract across multiple?

Abstract behind a common Provider interface from day one. This lets you swap models without changing business logic, implement automatic fallback from an expensive model to a cheaper one during outages, and A/B test providers. The interface overhead is minimal compared to the operational flexibility.

Keep Reading

Building Production RAG Pipelines — Chunking, embeddings, and retrieval strategies for feeding context into the LLM calls this article orchestrates
Spring AI in Production — The Java/Spring equivalent: RAG pipelines, circuit breakers, and observability for LLM-powered backends
The 3 Pillars of Observability — Prometheus metrics, structured logging, and distributed tracing for the LLM observability layer

Was this article helpful?

Your feedback directly shapes our editorial depth and technical accuracy.