Skip to content

The 3 Pillars of Observability: Metrics, Logs, and Traces in Production

BackendBytes Engineering Team
BackendBytes Engineering Team
8 min read
The 3 Pillars of Observability: Metrics, Logs, and Traces in Production

Key Takeaways

  • The classic peak-traffic incident pattern — three observability blind spots converge: error rate counts only HTTP 500s (not application rejections like declined payments), traces sampled success-only, logs sampled 50%
  • Instrument HTTP at handler boundaries (not inside business logic) with router-normalized labels (`/users/{id}` not `/users/123`) — unbounded label cardinality OOMs Prometheus
  • Histogram buckets must include your SLO threshold as an explicit boundary — without `le="0.2"` as a bucket, Prometheus interpolation makes latency imprecise
  • Never sample errors or slow requests — tail-based sampling keeps 100% of errors and requests >1s latency, then samples 1-5% of successful ones at 95% cost reduction
  • Trace ID in every log line enables handoff: metrics alert (error spike) → traces (which service) → logs (specific error) — without correlation, triage is 10x slower

The classic three-pillars-blind production incident. Dashboards stay green during peak traffic — error rate steady, p50 fast, four-nines uptime — while customer support phones light up. Metrics count only HTTP 500s, traces sample 10 percent and exclude errors, logs sample 50 percent. We debugged this exact incident shape on multiple payment platforms: every pillar correct in isolation, every pillar blind in the same way.

When All Three Pillars Fail Together

The classic peak-traffic incident pattern we've debugged on payment platforms: dashboards green (low single-digit error rate, fast p50, four-nines uptime), customer support phones lighting up. Composite of incidents, not a single event:

  1. Metrics blind spot: The error rate metric counts only HTTP 500s. Actual failures are HTTP 200s with {"status": "declined"} in the body — application-level failures invisible to infrastructure monitoring.

  2. Traces blind spot: Distributed traces sampled at 10%, and the sample filtered for duration > 5s AND status = success. Payment gateway timeouts eventually return errors, so they never appear in the "slow traces" view. [Beyer et al., 2016]

  3. Logs blind spot: Logs sampled at 50% per endpoint. Half the payment decline logs never make it to disk. [Beyer et al., 2016]

The gateway is rejecting connections under peak load. Three pillars. Three failures. One hour to root cause when the failure modes compound. The rest of this article is about preventing that.

Quick Take

Observability requires correlating metrics (what broke), logs (why), and traces (where). But tools alone don't prevent incidents — you must instrument at service boundaries, never sample errors, and route from metrics → traces → logs with trace ID correlation. Tail-based sampling[OpenTelemetry Sampling] keeps 100% of error traces and slow requests while in our experience cutting trace storage cost by an order of magnitude.

  • Instrument HTTP handlers with Prometheus histograms[Prometheus Best Practices]; use labels for cardinality control, not raw paths
  • Structure logs as JSON with trace_id propagation; sample INFO at 10-50%, never sample WARN/ERROR
  • Use OpenTelemetry for traces; tail-sample errors and latency > 1s, 1-5% of successes

Triage Flow: Metrics → Traces → Logs

Each pillar answers a different question and costs different amounts. The discipline during an incident is: never start at logs, never alert on traces, never explain "why" with metrics. Route by what you need:

graph LR
    Page[Pager fires] --> M[Metrics<br/>Prometheus, Grafana]
    M -->|What broke?<br/>error rate spike,<br/>latency p99 climb| MetricFound[Identify the SLO<br/>that is breaching]
    MetricFound -->|Need: where<br/>in the request path?| T[Traces<br/>OpenTelemetry, Tempo]
    T -->|Drill into the<br/>slow or failing span| TraceFound[Identify the<br/>service or DB call]
    TraceFound -->|Need: why<br/>exactly?| L[Logs<br/>structured JSON,<br/>trace_id-correlated]
    L -->|Filter by trace_id| Root[Root cause<br/>specific error message]
    Root -.->|Fix or page<br/>the right team| Resolve[Resolve incident]
    style M fill:#dfd
    style T fill:#ffd
    style L fill:#fdd
    style Root fill:#dfd

The diagram is the entire on-call discipline in one picture[OpenTelemetry Sampling]: metrics alert (cheap, always-on), traces locate (sampled but always keep errors), logs explain (most expensive — only correlated by trace_id).

The Quick Start: Which Pillar For What

Each pillar answers a different question and costs different amounts. Use them in sequence during incidents:

SignalQuestionCostUse case
MetricsWhat broke?~$0.10/million pointsAlert on error rate, latency spikes. First signal to fire.
TracesWhere in the request path?~$1-3/million spans (with tail sampling)Identify which service/database call is slow or failing.
LogsWhy exactly?$0.50-1.50/GB ingested (varies by vendor; sample carefully)Get the specific error message and context.

The handoff during the incident: Metrics fired alarm (decline rate up) → Traces showed gateway.charge timing out at 29.8s → Logs revealed connection refused on one IP.

Metrics: Alert on RED (Rate, Errors, Duration)

Prometheus histograms give you percentile latencies and error rates[Prometheus Best Practices]. Instrument at HTTP handler boundaries, not inside business logic. The critical rule: use router-normalized paths for labels, never raw paths.

package metrics
 
import (
	"net/http"
	"strconv"
	"time"
	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
)
 
var httpDuration = promauto.NewHistogramVec(
	prometheus.HistogramOpts{
		Name:    "http_request_duration_seconds",
		Buckets: []float64{0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 2.5}, // SLO-aligned
	},
	[]string{"method", "endpoint", "status"},
)
 
// InstrumentedHandler: instrument at the boundary
func InstrumentedHandler(pattern string, next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		start := time.Now()
		rw := &responseWriter{ResponseWriter: w, statusCode: 200}
		next.ServeHTTP(rw, r)
 
		duration := time.Since(start).Seconds()
		status := strconv.Itoa(rw.statusCode)
		// Use router pattern (/users/{id}), NOT raw path (/users/123)
		httpDuration.WithLabelValues(r.Method, pattern, status).Observe(duration)
	})
}
 
type responseWriter struct {
	http.ResponseWriter
	statusCode int
}
 
func (rw *responseWriter) WriteHeader(code int) {
	rw.statusCode = code
	rw.ResponseWriter.WriteHeader(code)
}

Key rule: Never use unbounded label values (like raw user IDs or request paths). Each unique combination becomes a separate time series[Prometheus Best Practices]. At 10K requests/second with high cardinality, you OOM Prometheus. Use router patterns instead: /users/{id} not /users/123, /users/456, etc.

// Query latencies: error rate and p99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (endpoint, le))
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Logs: Structured JSON with Trace ID Correlation

Unstructured printf logs are unsearchable. Structured logs as JSON fields let you query and correlate by trace_id. Always add trace_id to every log so you can jump from metrics alert → trace → specific log lines.

package logging
 
import (
	"context"
	"fmt"
	"log/slog"
	"os"
	"go.opentelemetry.io/otel/trace"
)
 
func InitLogger() *slog.Logger {
	return slog.New(slog.NewJSONHandler(os.Stdout, nil))
}
 
// FromContext adds trace_id to logger for correlation
func FromContext(ctx context.Context, base *slog.Logger) *slog.Logger {
	span := trace.SpanFromContext(ctx)
	if !span.SpanContext().IsValid() {
		return base
	}
	return base.With(
		slog.String("trace_id", span.SpanContext().TraceID().String()),
	)
}
 
// Usage: every method starts with FromContext
func (s *OrderService) CreateOrder(ctx context.Context, req OrderRequest) (*Order, error) {
	log := logging.FromContext(ctx, s.logger)
 
	log.Info("creating order",
		slog.String("user_id", req.UserID),
		slog.Float64("amount", req.Amount),
	)
 
	order, err := s.repo.Insert(ctx, req)
	if err != nil {
		log.Error("order insertion failed", slog.Any("error", err))
		return nil, fmt.Errorf("insert: %w", err)
	}
	return order, nil
}

Critical sampling rules: Never sample WARN or ERROR logs — you need 100% of failures for root cause investigation. INFO logs can be sampled based on endpoint: [Beyer et al., 2016]

  • ERROR: 100% (always)
  • WARN: 100% (always)
  • INFO (business operations): 50-100%
  • INFO (/health, /metrics endpoints): 1-5%
  • DEBUG: 0% in production [Beyer et al., 2016]

Traces: Show Which Service Is Slow

OpenTelemetry creates spans for each service call, letting you see exactly where a request gets slow. Start a span at handler boundaries and auto-propagate trace context to downstream services.

The span tree is the latency budget made visible — one bar per operation, the wide ones tell you where the time went:

gantt
    title Trace timeline — checkout request, p99 = 480ms
    dateFormat X
    axisFormat %L
    section Edge
    HTTP handler (checkout)        :active, 0, 480
    section Service
    AuthZ check                    :crit, 5, 35
    Inventory.Reserve              :40, 80
    Payment.Charge                 :120, 380
    Order.Persist                  :385, 420
    Notification.Enqueue           :425, 470
    section Detail
    Stripe API (in Payment)        :crit, 130, 360
    DB INSERT (in Order.Persist)   :390, 420

The crit-coloured bars are where the time went: Stripe took 230ms inside Payment.Charge; AuthZ surprised the team with a 30ms hit. Without spans, the trace looks like "checkout slow" — with spans, the third-party API jumps out as the obvious p99 driver.

package tracing
 
import (
	"context"
	"fmt"
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/codes"
	"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
	"go.opentelemetry.io/otel/trace"
)
 
func InitTracer(ctx context.Context, svc string) (func(context.Context) error, error) {
	exporter, _ := otlptracehttp.New(ctx, otlptracehttp.WithEndpoint("otel-collector:4318"))
	tp := sdktrace.NewTracerProvider(sdktrace.WithBatcher(exporter))
	otel.SetTracerProvider(tp)
	return tp.Shutdown, nil
}
 
// CreateOrder shows correct span pattern: one span per logical operation
func (s *OrderService) CreateOrder(ctx context.Context, req OrderRequest) (*Order, error) {
	ctx, span := s.tracer.Start(ctx, "CreateOrder",
		trace.WithAttributes(attribute.String("user_id", req.UserID)),
	)
	defer span.End()
 
	// Database call: child span
	order, err := s.repo.Insert(ctx, req)
	if err != nil {
		span.RecordError(err)
		span.SetStatus(codes.Error, "insert failed")
		return nil, fmt.Errorf("insert: %w", err)
	}
	return order, nil
}

Trace propagation: Use OpenTelemetry HTTP middleware to auto-inject/extract trace context headers. This connects logs and traces across all services in the request path.

# Tail-based sampling: keep 100% of errors and slow requests, 1-5% of fast successes
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow_traces
        type: latency
        latency: { threshold_ms: 1000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 2 }

The OpenTelemetry Collector's tail_sampling processor implements this directly[OpenTelemetry Sampling]. As an order-of-magnitude example: at 10K req/s with full sampling and ~3KB/span, raw trace volume runs into terabytes per day; the policy above typically reduces that to tens of GB/day in our experience while keeping every trace you'd actually investigate.

Production Checklist

When deploying observability for the first time, follow this order:

  • Metrics: Add Prometheus histograms to HTTP handlers (method, endpoint, status labels). Test cardinality: keep total series < 10 million per instance.
  • Logs: Structure logs as JSON with trace_id field. Never sample ERROR/WARN. Sample INFO at 10-50% depending on endpoint volume.
  • Traces: Initialize OpenTelemetry with OTLP exporter. Wrap HTTP handlers with otelhttp middleware for auto-propagation.
  • Sampling: Enable tail-based sampling on the OTel Collector to keep 100% of errors and slow (>1s) requests, 1-5% of fast successes.
  • Alerting: Create 2-3 initial alerts on p99 latency and error rate. Every alert must link to a runbook. Route P1 alerts to PagerDuty, P2 to Slack.
  • Correlation: Verify that error rate spike in metrics links to spans in traces, spans link back to logs via trace_id. [Prometheus Best Practices]

The entire setup takes 2-3 days for a small service. Tail-based sampling costs reduction validates immediately.

From metrics to traces to logs in 60 seconds — the unified runbook

The pillars only pay off if your on-call can pivot between them without reading documentation. The four artefacts below — a Grafana panel with exemplars, a Loki LogQL query keyed on trace_id, a Tempo lookup, and a wrapper shell script — are the paste-ready glue that turns three disconnected dashboards into a single triage flow.

A Grafana panel that renders both the latency histogram and the error-rate side-by-side, with exemplars: true on the histogram so each bucket exposes the raw trace IDs that contributed to it. Click an exemplar dot in the p99 line and Grafana opens the exact trace in Tempo:

{
  "title": "Checkout — p99 latency + error rate (with exemplars)",
  "type": "timeseries",
  "datasource": "Prometheus",
  "targets": [
    {
      "refId": "A",
      "expr": "histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket{endpoint=\"/checkout\"}[5m])))",
      "legendFormat": "p99 latency",
      "exemplar": true
    },
    {
      "refId": "B",
      "expr": "sum(rate(http_requests_total{endpoint=\"/checkout\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{endpoint=\"/checkout\"}[5m]))",
      "legendFormat": "error rate"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "custom": { "drawStyle": "line", "lineWidth": 2 },
      "links": [
        {
          "title": "Open trace in Tempo",
          "url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"${__value.raw}\"}]}"
        }
      ]
    }
  },
  "options": { "tooltip": { "mode": "multi" } }
}

Once an exemplar gives you a trace_id, the next pivot is logs. This LogQL query pulls every log line matching the trace_id and groups by service, so the on-call sees in one view which service emitted which error message:

{service=~"checkout|payment|inventory"}
  | json
  | trace_id = "4a7c9e2f1d8b3a6e5f0c1b2d3e4f5a6b"
  | line_format "{{.timestamp}} [{{.service}}] {{.level}} {{.msg}} ({{.error}})"

Going the other direction — logs telling you "something is wrong" before metrics flag it — Tempo accepts the same trace_id and returns the full distributed span tree so you can see exactly which downstream call exploded:

{ trace:id = "4a7c9e2f1d8b3a6e5f0c1b2d3e4f5a6b" }
  | select(span.name, span.duration, span.status, resource.service.name)
  | by(resource.service.name)

Wrap all three pivots in a single shell script and your on-call's first reaction to a page is to paste the trace_id from the alert payload and have the answer in under a minute:

#!/usr/bin/env bash
# runbook.sh — metrics alert → trace → logs in under 60 seconds
# Usage: ./runbook.sh <trace_id>
set -euo pipefail
 
TRACE_ID="${1:?trace_id required (copy from PagerDuty exemplar link)}"
TEMPO_URL="${TEMPO_URL:-http://tempo:3200}"
LOKI_URL="${LOKI_URL:-http://loki:3100}"
WINDOW="${WINDOW:-15m}"
 
echo "==> 1. Span tree (which service is slow / failing?)"
curl -sf "${TEMPO_URL}/api/traces/${TRACE_ID}" \
  | jq -r '.batches[].scopeSpans[].spans[]
           | "\(.name)\t\(.status.code // "OK")\t\((.endTimeUnixNano - .startTimeUnixNano)/1e6 | floor)ms"' \
  | column -t -s $'\t'
 
echo
echo "==> 2. Correlated logs (why exactly?)"
QUERY="{service=~\".+\"} | json | trace_id = \"${TRACE_ID}\""
curl -sf -G "${LOKI_URL}/loki/api/v1/query_range" \
  --data-urlencode "query=${QUERY}" \
  --data-urlencode "since=${WINDOW}" \
  | jq -r '.data.result[].values[][] | fromjson? // .
           | "\(.timestamp // "?") [\(.service // "?")] \(.level // "?") \(.msg // .)"'
 
echo
echo "==> 3. Next steps: page service owner from span tree, paste trace link in incident channel"

If you cannot answer "what failed, where, and why" from a single trace_id in under a minute, your three pillars are not actually correlated — they are three separate dashboards with the same login.


Cardinality control: the silent killer of Prometheus reliability

The most common Prometheus outage we have debugged is not query load — it is unbounded label cardinality. A single rogue label with one value per request multiplies the active series count linearly with traffic, blows past the head chunk allocation, triggers the OOM killer mid-scrape, and leaves a fifteen-minute gap in every dashboard that mattered. The defence is cheap and mechanical: normalise high-cardinality labels at the instrumentation boundary, and reject the rest at the registry. Three patterns belong in every Go service that exposes Prometheus metrics. First, route templates not raw paths — /users/{id} rather than /users/123. Second, status classes not raw codes when the granularity is not actionable — 2xx for the SLO numerator, raw codes only on the error counter. Third, hash-bucketing for any label whose source you do not control, such as a user-supplied tenant header during a noisy onboarding storm.

The wrapper below collapses each of those rules into a single instrumentation helper. The router template and status class come from the framework; the tenant label is hashed to a fixed sixteen-bucket space, which keeps the active series count bounded even if a customer accidentally fans out one tenant ID per request. The handler also rejects writes to unknown label values via a registry guard, so a refactor that adds a fourth label fails the unit test rather than the prod cluster:

package metrics
 
import (
	"hash/fnv"
	"strconv"
 
	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
)
 
// tenantBuckets bounds the cardinality of any caller-supplied tenant label.
// 16 buckets is plenty for hot-spot detection and cannot OOM the TSDB.
const tenantBuckets = 16
 
func tenantBucket(tenant string) string {
	if tenant == "" {
		return "unknown"
	}
	h := fnv.New32a()
	_, _ = h.Write([]byte(tenant))
	return "b" + strconv.FormatUint(uint64(h.Sum32()%tenantBuckets), 10)
}
 
// statusClass collapses 2xx/3xx/4xx/5xx for SLO numerators.
// Keep the raw status code only on the error counter where granularity matters.
func statusClass(code int) string {
	switch {
	case code >= 500:
		return "5xx"
	case code >= 400:
		return "4xx"
	case code >= 300:
		return "3xx"
	default:
		return "2xx"
	}
}
 
var requestDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
	Name:    "http_request_duration_seconds",
	Help:    "HTTP request latency by route template, method, status class, tenant bucket.",
	Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.2, 0.5, 1, 2, 5},
}, []string{"route", "method", "status_class", "tenant_bucket"})
 
// Observe is the only public entry point. Callers cannot bypass normalisation.
func Observe(routeTemplate, method string, statusCode int, tenant string, durationSec float64) {
	requestDuration.WithLabelValues(
		routeTemplate,
		method,
		statusClass(statusCode),
		tenantBucket(tenant),
	).Observe(durationSec)
}

A unit test that asserts len(prometheus.DefaultGatherer.Gather()) < threshold after a synthetic traffic burst catches every cardinality regression before it ships. Pair the helper with a recording rule that alerts when prometheus_tsdb_head_series grows faster than the rolling baseline — by the time the dashboard turns red, the OOM is already in flight.

Tail-based sampling pipelines that survive a real outage

Head-based sampling — the SDK flips a coin per trace and either keeps everything or drops everything — discards exactly the traces you need during an incident. Tail sampling moves the decision to the OpenTelemetry Collector, which buffers spans for a short window (ten to thirty seconds), assembles each trace, and applies policy after it has seen the whole tree. The two non-negotiable policies are: keep every trace that contains an error span, and keep every trace whose root duration exceeds the SLO threshold. Everything else gets a low base rate, typically one to five percent, which still gives statistically useful baselines without paying for the long tail of healthy traffic.

The Collector configuration below pairs tail_sampling with a groupbytrace processor so spans from a single trace assemble before the policy fires. It also routes sampled traces through batch for backend efficiency and emits the sampling decision as a span attribute, which makes it trivial to debug "why did this trace get dropped" weeks later:

receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }
 
processors:
  groupbytrace:
    wait_duration: 15s
    num_traces: 200000
 
  tail_sampling:
    decision_wait: 20s
    num_traces: 200000
    expected_new_traces_per_sec: 5000
    policies:
      - name: keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: keep-slow-roots
        type: latency
        latency: { threshold_ms: 1000 }
      - name: keep-priority-tenants
        type: string_attribute
        string_attribute:
          key: tenant.tier
          values: [enterprise, on-call-customer]
      - name: probabilistic-baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 2 }
 
  batch:
    send_batch_size: 8192
    timeout: 5s
 
exporters:
  otlphttp/tempo:
    endpoint: https://tempo.internal:4318
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [groupbytrace, tail_sampling, batch]
      exporters: [otlphttp/tempo]

Run two Collector replicas behind a load balancer that hashes on trace_id so every span from a given trace lands on the same instance — without that, the assembly window misses cross-replica spans and you get phantom partial traces. Pin decision_wait longer than your slowest p99 root span; if the SLO threshold is one second, twenty seconds of buffering catches the entire long tail without backing memory pressure into the receiver. Finally, emit otelcol_processor_tail_sampling_sampling_trace_dropped_too_early on a Grafana panel — when that counter starts growing, your buffer is undersized and you are losing the exact slow traces tail sampling was supposed to preserve.


Frequently Asked Questions

What is the difference between monitoring and observability?

Monitoring uses predefined checks to alert when something breaks. Observability lets you investigate why by correlating metrics, logs, and traces to understand failures you didn't anticipate.

How do metrics, logs, and traces work together during an incident?

Metrics alert you (error rate spike). Traces show where the failure occurs (which service). Logs give you the specific error details. This handoff enables fast root cause analysis.

What is OpenTelemetry and why use it?

OpenTelemetry is a vendor-neutral standard for collecting metrics, logs, and traces. One SDK works with any observability backend (Datadog, Grafana, Jaeger), avoiding vendor lock-in.

How should you sample traces in production?

Use tail-based sampling instead of fixed-rate head sampling. Tail-based keeps 100% of errors and slow requests, then samples 1-5% of successful ones — you get all the traces worth investigating at lower cost. [Beyer et al., 2016]

Keep Reading

BackendBytes Engineering Team
BackendBytes Engineering Team

Engineering Team

A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.

Read Next