SRE Guide to SLOs, SLIs, and Error Budgets: A Production Playbook
Key Takeaways
- →Four nines (99.99%) means your entire month's error budget is gone after a 4-minute outage — most services should start at 99.5-99.9%, chosen based on business impact not current performance
- →Error budgets turn 'we should focus on reliability' from an engineering opinion into an organizational policy — when budget hits zero, deployments stop automatically, and teams routinely pre-save budget heading into peak seasons
- →Multi-window burn-rate alerting (1h + 6h windows) catches real incidents while ignoring spikes — 14.4× burn rate exhausts month's budget in 50 hours and pages immediately; single-threshold alerts fire too often or too late
- →Your histogram buckets must include your SLO threshold as an explicit boundary — without `le="0.2"` as a bucket, Prometheus interpolation makes latency measurement imprecise and SLI unreliable
- →SLI definition matters more than SLO value — measure what users experience (requests succeeded, under latency threshold, data correctness), not infrastructure (CPU, memory); application-level failures like `{status: declined}` are invisible to HTTP metrics
The pattern error budgets enable — illustrated with rounded numbers from the kind of incident postmortems you'll read on engineering blogs:
- A team's quarterly error budget is mostly gone by November after a string of small incidents. None of the incidents triggered a P0 alert; CPU and latency dashboards stayed green. But the SLO burn rate doesn't lie — keep the same pace through December and the budget is exhausted before holiday traffic peaks.
- Policy triggers automatically: feature freeze, reliability squad, work resumes when the budget recovers. Features that get delayed ship in January. Incident frequency drops the following quarter.
The real value: error budgets convert "we should focus on reliability" from an engineering opinion into an organizational fact. When the budget is gone, the data makes the decision — not the loudest VP. This is the framework Google's SRE practice formalised[Beyer et al., 2016].
Error budgets = the inverse of your SLO. A 99.9% SLO gives you 43 minutes of monthly downtime to spend on incidents or risky deploys[Beyer et al., 2016]. Multi-window burn-rate alerts (combining 1h + 6h windows) catch real incidents while ignoring spikes. When budget hits zero, deployment policy stops feature work.
- SLI = what users experience (requests succeeded, under 200ms latency, data correct)
- SLO = internal target (99.9% of requests succeed over 30 days)[Beyer et al., 2016]
- Burn rate = how fast you're consuming budget; 14.4× = exhausts entire month in 50 hours
The quick start: SLIs, SLOs, and error budgets
The relationship between SLI, SLO, and error budget in one picture — measure → target → spend:
graph LR
Users[Users send requests] --> Service[Service handles them]
Service -->|measure user experience| SLI[SLI<br/>Service Level Indicator<br/>e.g. p99 latency, success rate]
SLI -->|compare to target| SLO[SLO<br/>Service Level Objective<br/>e.g. 99.9% under 200 ms]
SLO -->|inverse| Budget[Error Budget<br/>0.1% = 43 min/month<br/>downtime allowance]
Budget -->|spent on| Risk[Risky deploys<br/>infra changes<br/>incidents]
Budget -.->|burn rate alert<br/>14.4x = page immediately| Page[On-call page]
Risk -->|drains| Budget
Page --> Triage[Triage + halt<br/>risky deploys]
style SLI fill:#dfd
style SLO fill:#ffd
style Budget fill:#ffd
style Page fill:#fdd
The SLI/SLO/Error Budget terminology in this section follows Google's SRE Book[Beyer et al., 2016]. The minute-level downtime conversions come from the standard "nines" math (e.g. 99.9% over 30 days = 43.2 minutes of allowed downtime).
| Concept [Beyer et al., 2016] | What it measures | Example |
|---|---|---|
| SLI (Service Level Indicator) | Quantitative metric users experience — success rate, latency, quality. Not CPU, not memory. | "94% of requests completed under 200ms this hour" |
| SLO (Service Level Objective) | Internal reliability target over a time window (usually 30 days). | "95% of requests must complete under 200ms" |
| Error Budget | The failure allowance: the inverse of your SLO. A 99.9% SLO = 0.1% error budget = 43 minutes of downtime per month. | "Our 99.9% SLO gives us 43 min/month to spend on incidents or risky deploys." |
| Burn Rate | How fast you're consuming your error budget relative to sustainable speed (1.0× = on-track for month-end). | "A 14.4× burn rate exhausts a 30-day budget in ~50 hours. Page immediately." |
SLO targets depend on service criticality. Tier examples we've seen in production: checkout and payments at 99.95% (21.6 min/month allowed downtime), feed and search at 99.9% (43 min/month), recommendations and analytics at 99.5% (3.6 hours/month). The rule: if your system currently runs at 99.95% availability but your SLO is also 99.95%, your SLO is meaningless — tighten it after a quarter of good performance data[Beyer et al., 2016].
Step 1: Instrument your service with SLI metrics
[Prometheus Best Practices]Wrap your HTTP router with a middleware that captures what users experience: success (2xx/3xx), latency, and errors (5xx). Record these as Prometheus counters and histograms.
package sli
import (
"net/http"
"strconv"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
requestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests by status class",
},
[]string{"service", "method", "status_class"},
)
requestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latency in seconds",
// Critical: include your SLO threshold (e.g., 0.2s) as an explicit bucket.
// histogram_quantile() can only interpolate within bucket boundaries.
Buckets: []float64{0.01, 0.05, 0.1, 0.2, 0.5, 1.0, 2.5, 5.0},
},
[]string{"service", "method"},
)
)
// Middleware for HTTP router
func SLIMiddleware(service string, next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
rec := &statusRecorder{ResponseWriter: w, status: http.StatusOK}
next.ServeHTTP(rec, r)
duration := time.Since(start).Seconds()
statusClass := strconv.Itoa(rec.status/100) + "xx"
requestsTotal.WithLabelValues(service, r.Method, statusClass).Inc()
requestDuration.WithLabelValues(service, r.Method).Observe(duration)
})
}
type statusRecorder struct {
http.ResponseWriter
status int
}
func (r *statusRecorder) WriteHeader(code int) {
r.status = code
r.ResponseWriter.WriteHeader(code)
}Then compute your SLIs directly from these metrics in PromQL:
# Availability: % of requests that succeeded (2xx/3xx)
sum(rate(http_requests_total{service="checkout", status_class=~"2xx|3xx"}[30d]))
/ sum(rate(http_requests_total{service="checkout"}[30d]))
# Latency: % of requests under 200ms threshold
sum(rate(http_request_duration_seconds_bucket{service="checkout", le="0.2"}[30d]))
/ sum(rate(http_request_duration_seconds_count{service="checkout"}[30d]))Key rule: Your histogram buckets must include your SLO threshold as an explicit boundary. Without le="0.2" as a bucket, Prometheus interpolation makes latency measurement imprecise. See The 3 Pillars of Observability for the full metrics + logging + traces stack.
Step 2: Define SLO targets and error budget policy
Write down your SLO and the actions that follow when budget is spent. Without a written policy, reliability discussions default to hierarchy and shouting.
A 99.9% SLO (checkout, payments) gives you 43 minutes of monthly error budget. A 99.5% SLO (feed, search) gives you 3.6 hours. Choose based on user impact, not on what your current system achieves[Beyer et al., 2016].
The SLO Document — one page, signed off by engineering and product leadership:
### Checkout Service SLO — Q1 2026
### SLO Targets (30-day rolling window)
- Availability: 99.95% → 21.6 min/month error budget
- Latency: 95% of requests < 200ms → 5% slow budget
- Quality: 99.9% error-free responses
### Error Budget Policy
- > 75% remaining: Normal deployment velocity
- 50–75%: Cautious deploys; staging validation required
- 25–50%: Reliability focus; defer risky changes
- < 25%: Feature freeze; reliability work only
- 0%: Emergency fixes only; VP Engineering notified
### Review Cadence
Monthly SLO review. Quarterly target adjustment.In our experience, store this in your wiki or runbook. The policy removes politics: when a product manager asks "why can't we ship feature X?", the answer is "our error budget is at 18% — our own policy says no non-critical deploys below 25%." Agreed in advance, enforced by data, not opinions.
Step 3: Multi-window burn-rate alerting
[Beyer et al., 2016]A single-threshold alert ("alert when error rate > 1% for 5 min") fires too often or too late. A 5-minute spike from a transient failure looks identical to the start of a real incident.
Burn rate = (current error rate) / (tolerable error rate). For a 99.9% SLO, tolerable error rate = 0.1%. If you're actually erroring at 1.4%, burn rate = 1.4 / 0.1 = 14×. The Google SRE Workbook formalises 14.4× as the "1-hour window" alert threshold — exhausts a full 30-day budget in ~50 hours, page immediately[Beyer et al., 2016]. A 6× burn rate (monthly budget exhausted in 5 days) warrants investigation but not paging at 2am.
Multi-window alerting requires both a short window (to detect incidents quickly) and a long window (to confirm they're sustained, not spikes). A 5-minute error spike lights up the 1-hour window but barely registers in a 6-hour window — so you don't alert.
graph TD
ErrRate["Current error rate"] --> Burn["burn rate<br/>= error rate / SLO threshold"]
Burn --> Short{"Short window<br/>(1h) burn ≥ 14.4×?"}
Burn --> Long{"Long window<br/>(6h) burn ≥ 14.4×?"}
Short -->|yes| AndGate{"AND"}
Long -->|yes| AndGate
AndGate -->|both| Page["PAGE<br/>(budget dies in ~50h)"]
Short -->|yes, long=no| Ignore["suppress<br/>(transient spike)"]
Long -->|yes, short=no| Ignore
The AND gate is what turns raw burn-rate into actionable paging. A single-window check fires on every 5-minute blip; a two-window check confirms the incident is still hot after enough signal has accumulated to separate noise from a real outage.
groups:
- name: error_budget_burn
rules:
# CRITICAL: 14.4× burn sustained for 1h + 6h
- alert: ErrorBudgetBurnCritical
expr: |
(1 - sum(rate(http_requests_total{service="checkout", status_class=~"2xx|3xx"}[1h]))
/ sum(rate(http_requests_total{service="checkout"}[1h])))
/ (1 - 0.9995) > 14.4
AND
(1 - sum(rate(http_requests_total{service="checkout", status_class=~"2xx|3xx"}[6h]))
/ sum(rate(http_requests_total{service="checkout"}[6h])))
/ (1 - 0.9995) > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "Error budget burning at 14.4× sustainable rate (50 hr to exhaustion)"
# HIGH: 6× burn sustained for 6h + 1d
- alert: ErrorBudgetBurnHigh
expr: |
(1 - sum(rate(http_requests_total{service="checkout", status_class=~"2xx|3xx"}[6h]))
/ sum(rate(http_requests_total{service="checkout"}[6h])))
/ (1 - 0.9995) > 6
AND
(1 - sum(rate(http_requests_total{service="checkout", status_class=~"2xx|3xx"}[1d]))
/ sum(rate(http_requests_total{service="checkout"}[1d])))
/ (1 - 0.9995) > 6
for: 15m
labels:
severity: high
annotations:
summary: "Error budget burning at 6× rate (~5 days to exhaustion)"Add these queries to your Grafana dashboard for real-time visibility:
# Error budget remaining (0 to 1)
1 - ((1 - sum(rate(http_requests_total{service="checkout", status_class=~"2xx|3xx"}[30d]))
/ sum(rate(http_requests_total{service="checkout"}[30d])))
/ (1 - 0.9995))
# Hours until budget exhaustion at current 6h burn rate
(1 - 0.9995) * 30 * 24
/ max((1 - sum(rate(http_requests_total{service="checkout", status_class=~"2xx|3xx"}[6h]))
/ sum(rate(http_requests_total{service="checkout"}[6h]))), 0.000001)Step 4: Error budgets in practice
In our experience, when a team's error budget hits a critical threshold — say, 22% remaining with high burn rate — the policy triggers automatically: feature freeze until budget recovers. The features ship next quarter. Incident frequency drops because the team focused on reliability instead of shipping under pressure.
Without error budgets, that conversation is a negotiation between engineers (who feel the pain) and product managers (who feel shipping pressure). With error budgets, the policy makes the decision weeks before the pressure arrives. Build the framework before you need it.
Common error budget policies in practice:
- Quarterly budget reviews: Services consistently consuming
<30%of budget get tighter SLOs in the next quarter (cost-benefit trade-off). - Deployment gating: Pipeline checks error budget before releasing. If budget < 25%, deployment is blocked with "wait for recovery or file exception" message[Beyer et al., 2016]. (Google's SRE Workbook documents this pattern.)
- Post-mortem budget tracking: Incident post-mortems include a "budget impact" section. A 2% budget burn from a partial outage is triaged differently than a 15% burn from a complete outage of the same duration.
Production checklist
- Instrumentation: Wrap HTTP router with
SLIMiddleware. Verifyhttp_requests_totalandhttp_request_duration_secondsappear in Prometheus with correct labels. - SLO definition: Write SLO document with error budget policy. Get sign-off from engineering lead and product manager.
- Histogram buckets: Include your latency SLO threshold (e.g., 0.2s) as an explicit bucket boundary. Without it, PromQL measurement is imprecise.
- Alerting rules: Deploy multi-window burn-rate alerts (critical: 14.4× for 1h+6h; high: 6× for 6h+1d). Test PagerDuty integration.
- Dashboards: Add "error budget remaining" and "hours to exhaustion" queries to main SRE dashboard.
- Monitoring: Track actual vs. predicted budget consumption monthly. Adjust SLO targets quarterly based on performance data.
Multi-window burn-rate alerts you can paste directly
The two-tier policy from Google's SRE workbook — page on fast burn (14.4× depleting the monthly budget in ~2 days unchecked), ticket on slow burn (6× depleting in ~5 days). The 1h+6h pair catches localised incidents; the 6h+1d pair catches sustained drift the short-window check misses:
# alerts/slo.yml — 99.9% availability SLO over a 30-day window
groups:
- name: slo.checkout.availability
interval: 30s
rules:
- record: slo:sli_error:ratio_rate1h
expr: |
sum(rate(http_requests_total{job="checkout",status=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="checkout"}[1h]))
- record: slo:sli_error:ratio_rate6h
expr: |
sum(rate(http_requests_total{job="checkout",status=~"5.."}[6h]))
/ sum(rate(http_requests_total{job="checkout"}[6h]))
- record: slo:sli_error:ratio_rate1d
expr: |
sum(rate(http_requests_total{job="checkout",status=~"5.."}[1d]))
/ sum(rate(http_requests_total{job="checkout"}[1d]))
- alert: ErrorBudgetBurnFast
expr: |
slo:sli_error:ratio_rate1h > (14.4 * 0.001)
and
slo:sli_error:ratio_rate6h > (6 * 0.001)
for: 2m
labels: { severity: critical, team: checkout }
annotations:
summary: "Checkout burning 14.4× SLO budget — page now"
description: "At current 1h burn rate the 30-day budget exhausts in <2 days. Investigate before users notice."
- alert: ErrorBudgetBurnSlow
expr: |
slo:sli_error:ratio_rate6h > (6 * 0.001)
and
slo:sli_error:ratio_rate1d > (1 * 0.001)
for: 15m
labels: { severity: warning, team: checkout }
annotations:
summary: "Checkout burning 6× SLO budget — open a ticket"
description: "Sustained low-grade error rate. Will exhaust budget in ~5 days if uncorrected."The query that powers your "budget remaining" dashboard tile — emits the percentage of the 30-day budget still available, so the on-call sees 73% remaining (4d 18h ahead of forecast) instead of guessing from raw error rates:
# slo_error_budget_remaining_ratio — paste as a recording rule for caching
1 - (
sum(increase(http_requests_total{job="checkout",status=~"5.."}[30d]))
/
(
sum(increase(http_requests_total{job="checkout"}[30d]))
* 0.001 # the 0.1% allowed-error fraction = (1 - 0.999)
)
)The 0.001 is the only knob — change it to match your SLO target (0.0001 for 99.99%, 0.005 for 99.5%). Everything else stays[Beyer et al., 2016].
Pair it with a Slack notification webhook so engineers see the budget tile drift before the page fires:
# alertmanager.yml — route the warning-tier ErrorBudgetBurnSlow to Slack
receivers:
- name: checkout-slack
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#oncall-checkout'
title: '{{ .CommonLabels.alertname }} — {{ .CommonLabels.severity }}'
text: |
{{ range .Alerts }}
*Summary:* {{ .Annotations.summary }}
*Burn rate:* {{ .Annotations.description }}
*Runbook:* https://runbooks.example/slo-burn
{{ end }}
send_resolved: trueA simple budget-remaining query you can run ad-hoc (e.g., from promtool query instant) when an executive asks "how close are we to violating the SLO this month?":
promtool query instant http://prometheus:9090 \
'slo_error_budget_remaining_ratio{job="checkout"} * 100'
# Expected: 73.4 (73.4% of the 30-day budget still available)User-journey SLOs vs API-endpoint SLOs
Per-endpoint SLOs lie about user experience. A checkout flow that calls product-service, cart-service, payment-service, and order-service can have every endpoint at 99.95% availability and still drop one in five hundred customers — because each microservice failure compounds along the journey. The user does not care that POST /payments/charge met its SLO. The user cares that the journey from "Add to cart" to "Order confirmed" worked end-to-end[Beyer et al., 2016].
Composite SLOs measure what users actually experience. A checkout journey SLO instruments the user-facing funnel — typically with a journey-id attached at the gateway and propagated through every hop — and counts a journey successful only when every step in the chain succeeds within the latency budget. The math is unforgiving: four 99.95% services chained sequentially produce a journey availability of 0.9995^4 = 99.80%, which equals 86 minutes of monthly downtime instead of the 21.6 minutes each individual service promises[Beyer et al., 2016].
The instrumentation pattern is a journey counter recorded at the terminal step, with a label for the failed stage when the journey fails. This lets you keep one SLO for the whole flow and still answer "where are journeys dying?" without joining four endpoint metrics:
// Emit one metric per completed journey, labelled by terminal status and failed stage.
var journeyOutcomes = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "checkout_journey_outcomes_total",
Help: "Checkout journeys by terminal status and failed stage",
},
// failed_stage = "" on success; "cart" | "payment" | "fulfilment" | ... on failure
[]string{"outcome", "failed_stage"},
)
func RecordJourney(ctx context.Context, j Journey) {
if j.Err == nil && j.Duration <= 4*time.Second {
journeyOutcomes.WithLabelValues("success", "").Inc()
return
}
stage := j.FailedStage // populated by the step that returned the error
if stage == "" && j.Duration > 4*time.Second {
stage = "latency"
}
journeyOutcomes.WithLabelValues("failure", stage).Inc()
}The recording rule then computes both the headline journey SLI and a per-stage attribution view that points the on-call at the right service without dashboard hunting:
# Headline: journey-level success rate over 30 days (the SLO that matters)
sum(rate(checkout_journey_outcomes_total{outcome="success"}[30d]))
/ sum(rate(checkout_journey_outcomes_total[30d]))
# Attribution: which stage is consuming the most journey error budget right now?
topk(3,
sum by (failed_stage) (
rate(checkout_journey_outcomes_total{outcome="failure", failed_stage!=""}[6h])
)
)In our experience, the attribution query is the operational payoff. When a journey-level burn-rate alert fires, the first dashboard tile shows that 73% of failures in the last six hours are tagged failed_stage="payment", so the on-call pages the payments team instead of opening four runbooks. Without the journey label, you would see only "checkout availability dropped" and spend twenty minutes correlating endpoint dashboards by hand.
Two non-obvious rules from running journey SLOs in production:
- Set the journey latency budget at the user-perceived boundary, not the sum of per-service p99s. If product expects checkout to feel snappy under 4 seconds end-to-end, that 4 seconds is the SLO — even if the four downstream services budget for 1.5 seconds each and "fit" on paper. Tail amplification (a 1% slow rate at each of four services compounds to ~3.9% slow journeys) means the per-service math always understates the journey p99[Dean & Barroso, 2013].
- Do not double-count budget. Keep per-service SLOs as health signals that page the owning team, but only the journey SLO gates the deployment policy. Otherwise a payments deploy gets blocked because cart-service burned its independent budget on an unrelated incident, which trains teams to ignore the policy.
The endpoint SLOs still earn their keep — they tell the cart team their service is fine when checkout is failing — but the journey SLO is what product, on-call, and the error-budget policy meeting all reference. Build it the moment you have more than two services in a critical user flow.
Frequently Asked Questions
What is an error budget and how is it calculated?
An error budget is the inverse of your SLO target — the amount of unreliability you can tolerate. A 99.9% SLO gives you a 0.1% error budget, which translates to about 43 minutes of allowed downtime per month[Beyer et al., 2016]. When the budget is exhausted, the policy dictates halting risky deployments until it recovers.
What is the difference between an SLO and an SLA?
An SLO (Service Level Objective) is an internal reliability target your team sets and monitors. An SLA (Service Level Agreement) is an external contractual commitment with financial penalties for violations. SLOs should always be stricter than SLAs to provide a safety margin.
What is multi-window burn rate alerting?
Multi-window burn rate alerting (from the Google SRE Workbook) triggers alerts based on how fast you are consuming your error budget relative to the budget period. It uses multiple time windows (e.g., 1-hour and 6-hour) to distinguish sustained burns from brief spikes, reducing alert noise while catching real incidents.
How do you choose the right SLO target for a service?
Base your SLO on user expectations and business impact, not on what your system currently achieves. Start with a slightly lower target than current performance, measure for a quarter, then tighten. Four nines (99.99%) means only 4 minutes of monthly downtime — most services should start at 99.5-99.9%[Beyer et al., 2016].
Keep Reading
- The 3 Pillars of Observability — Prometheus metrics, structured logging, and OpenTelemetry traces that feed the SLIs your error budgets depend on
- Terraform in Production — Modules, state management, and CI/CD pipelines for the infrastructure that backs your SLO targets
- Building Resilient Distributed Systems with Go — Circuit breakers, bulkheads, and retry policies that protect your error budget from cascading failures
- Linux Commands Cheat Sheet — When the burn-rate alert fires, the next layer of triage is SSH into the box: ss, lsof, journalctl, top
- Distributed Rate Limiting (Probabilistic Drop) — Shed load before it burns the error budget; drop_ratio converges global RPS to the SLO without per-request coordination
Engineering Team
A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.
Read Next
The 3 Pillars of Observability: Metrics, Logs, and Traces in Production
Master observability with real code: Prometheus metrics, structured logging with slog, and OpenTelemetry tracing to debug incidents fast.
Terraform in Production: Modules, State Management, and CI/CD Patterns
Terraform in production: state locking, module design, environment directories, and CI/CD guardrails that prevent resource destruction.
Essential Kubernetes Commands: The Complete kubectl Cheat Sheet
Definitive kubectl reference: pod debugging, deployments, StatefulSets, RBAC, scheduling, Helm, and production troubleshooting flowcharts.