Skip to content

SRE Guide to SLOs, SLIs, and Error Budgets: A Production Playbook

BackendBytes Engineering Team
BackendBytes Engineering Team
9 min read
SRE Guide to SLOs, SLIs, and Error Budgets: A Production Playbook

Key Takeaways

  • Four nines (99.99%) means your entire month's error budget is gone after a 4-minute outage — most services should start at 99.5-99.9%, chosen based on business impact not current performance
  • Error budgets turn 'we should focus on reliability' from an engineering opinion into an organizational policy — when budget hits zero, deployments stop automatically, and teams routinely pre-save budget heading into peak seasons
  • Multi-window burn-rate alerting (1h + 6h windows) catches real incidents while ignoring spikes — 14.4× burn rate exhausts month's budget in 50 hours and pages immediately; single-threshold alerts fire too often or too late
  • Your histogram buckets must include your SLO threshold as an explicit boundary — without `le="0.2"` as a bucket, Prometheus interpolation makes latency measurement imprecise and SLI unreliable
  • SLI definition matters more than SLO value — measure what users experience (requests succeeded, under latency threshold, data correctness), not infrastructure (CPU, memory); application-level failures like `{status: declined}` are invisible to HTTP metrics

The pattern error budgets enable — illustrated with rounded numbers from the kind of incident postmortems you'll read on engineering blogs:

  • A team's quarterly error budget is mostly gone by November after a string of small incidents. None of the incidents triggered a P0 alert; CPU and latency dashboards stayed green. But the SLO burn rate doesn't lie — keep the same pace through December and the budget is exhausted before holiday traffic peaks.
  • Policy triggers automatically: feature freeze, reliability squad, work resumes when the budget recovers. Features that get delayed ship in January. Incident frequency drops the following quarter.

The real value: error budgets convert "we should focus on reliability" from an engineering opinion into an organizational fact. When the budget is gone, the data makes the decision — not the loudest VP. This is the framework Google's SRE practice formalised[Beyer et al., 2016].

Key Points

Error budgets = the inverse of your SLO. A 99.9% SLO gives you 43 minutes of monthly downtime to spend on incidents or risky deploys[Beyer et al., 2016]. Multi-window burn-rate alerts (combining 1h + 6h windows) catch real incidents while ignoring spikes. When budget hits zero, deployment policy stops feature work.

  • SLI = what users experience (requests succeeded, under 200ms latency, data correct)
  • SLO = internal target (99.9% of requests succeed over 30 days)[Beyer et al., 2016]
  • Burn rate = how fast you're consuming budget; 14.4× = exhausts entire month in 50 hours

The quick start: SLIs, SLOs, and error budgets

The relationship between SLI, SLO, and error budget in one picture — measure → target → spend:

graph LR
    Users[Users send requests] --> Service[Service handles them]
    Service -->|measure user experience| SLI[SLI<br/>Service Level Indicator<br/>e.g. p99 latency, success rate]
    SLI -->|compare to target| SLO[SLO<br/>Service Level Objective<br/>e.g. 99.9% under 200 ms]
    SLO -->|inverse| Budget[Error Budget<br/>0.1% = 43 min/month<br/>downtime allowance]
    Budget -->|spent on| Risk[Risky deploys<br/>infra changes<br/>incidents]
    Budget -.->|burn rate alert<br/>14.4x = page immediately| Page[On-call page]
    Risk -->|drains| Budget
    Page --> Triage[Triage + halt<br/>risky deploys]
    style SLI fill:#dfd
    style SLO fill:#ffd
    style Budget fill:#ffd
    style Page fill:#fdd

The SLI/SLO/Error Budget terminology in this section follows Google's SRE Book[Beyer et al., 2016]. The minute-level downtime conversions come from the standard "nines" math (e.g. 99.9% over 30 days = 43.2 minutes of allowed downtime).

Concept [Beyer et al., 2016]What it measuresExample
SLI (Service Level Indicator)Quantitative metric users experience — success rate, latency, quality. Not CPU, not memory."94% of requests completed under 200ms this hour"
SLO (Service Level Objective)Internal reliability target over a time window (usually 30 days)."95% of requests must complete under 200ms"
Error BudgetThe failure allowance: the inverse of your SLO. A 99.9% SLO = 0.1% error budget = 43 minutes of downtime per month."Our 99.9% SLO gives us 43 min/month to spend on incidents or risky deploys."
Burn RateHow fast you're consuming your error budget relative to sustainable speed (1.0× = on-track for month-end)."A 14.4× burn rate exhausts a 30-day budget in ~50 hours. Page immediately."

SLO targets depend on service criticality. Tier examples we've seen in production: checkout and payments at 99.95% (21.6 min/month allowed downtime), feed and search at 99.9% (43 min/month), recommendations and analytics at 99.5% (3.6 hours/month). The rule: if your system currently runs at 99.95% availability but your SLO is also 99.95%, your SLO is meaningless — tighten it after a quarter of good performance data[Beyer et al., 2016].


Step 1: Instrument your service with SLI metrics

[Prometheus Best Practices]

Wrap your HTTP router with a middleware that captures what users experience: success (2xx/3xx), latency, and errors (5xx). Record these as Prometheus counters and histograms.

package sli
 
import (
    "net/http"
    "strconv"
    "time"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)
 
var (
    requestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests by status class",
        },
        []string{"service", "method", "status_class"},
    )
 
    requestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request latency in seconds",
            // Critical: include your SLO threshold (e.g., 0.2s) as an explicit bucket.
            // histogram_quantile() can only interpolate within bucket boundaries.
            Buckets: []float64{0.01, 0.05, 0.1, 0.2, 0.5, 1.0, 2.5, 5.0},
        },
        []string{"service", "method"},
    )
)
 
// Middleware for HTTP router
func SLIMiddleware(service string, next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        rec := &statusRecorder{ResponseWriter: w, status: http.StatusOK}
        next.ServeHTTP(rec, r)
 
        duration := time.Since(start).Seconds()
        statusClass := strconv.Itoa(rec.status/100) + "xx"
 
        requestsTotal.WithLabelValues(service, r.Method, statusClass).Inc()
        requestDuration.WithLabelValues(service, r.Method).Observe(duration)
    })
}
 
type statusRecorder struct {
    http.ResponseWriter
    status int
}
 
func (r *statusRecorder) WriteHeader(code int) {
    r.status = code
    r.ResponseWriter.WriteHeader(code)
}

Then compute your SLIs directly from these metrics in PromQL:

# Availability: % of requests that succeeded (2xx/3xx)
sum(rate(http_requests_total{service="checkout", status_class=~"2xx|3xx"}[30d]))
/ sum(rate(http_requests_total{service="checkout"}[30d]))
 
# Latency: % of requests under 200ms threshold
sum(rate(http_request_duration_seconds_bucket{service="checkout", le="0.2"}[30d]))
/ sum(rate(http_request_duration_seconds_count{service="checkout"}[30d]))

Key rule: Your histogram buckets must include your SLO threshold as an explicit boundary. Without le="0.2" as a bucket, Prometheus interpolation makes latency measurement imprecise. See The 3 Pillars of Observability for the full metrics + logging + traces stack.


Step 2: Define SLO targets and error budget policy

Write down your SLO and the actions that follow when budget is spent. Without a written policy, reliability discussions default to hierarchy and shouting.

A 99.9% SLO (checkout, payments) gives you 43 minutes of monthly error budget. A 99.5% SLO (feed, search) gives you 3.6 hours. Choose based on user impact, not on what your current system achieves[Beyer et al., 2016].

The SLO Document — one page, signed off by engineering and product leadership:

### Checkout Service SLO — Q1 2026
 
### SLO Targets (30-day rolling window)
 
- Availability: 99.95% → 21.6 min/month error budget
- Latency: 95% of requests < 200ms → 5% slow budget
- Quality: 99.9% error-free responses
 
### Error Budget Policy
 
- > 75% remaining: Normal deployment velocity
- 50–75%: Cautious deploys; staging validation required
- 25–50%: Reliability focus; defer risky changes
- < 25%: Feature freeze; reliability work only
- 0%: Emergency fixes only; VP Engineering notified
 
### Review Cadence
 
Monthly SLO review. Quarterly target adjustment.

In our experience, store this in your wiki or runbook. The policy removes politics: when a product manager asks "why can't we ship feature X?", the answer is "our error budget is at 18% — our own policy says no non-critical deploys below 25%." Agreed in advance, enforced by data, not opinions.


Step 3: Multi-window burn-rate alerting

[Beyer et al., 2016]

A single-threshold alert ("alert when error rate > 1% for 5 min") fires too often or too late. A 5-minute spike from a transient failure looks identical to the start of a real incident.

Burn rate = (current error rate) / (tolerable error rate). For a 99.9% SLO, tolerable error rate = 0.1%. If you're actually erroring at 1.4%, burn rate = 1.4 / 0.1 = 14×. The Google SRE Workbook formalises 14.4× as the "1-hour window" alert threshold — exhausts a full 30-day budget in ~50 hours, page immediately[Beyer et al., 2016]. A 6× burn rate (monthly budget exhausted in 5 days) warrants investigation but not paging at 2am.

Multi-window alerting requires both a short window (to detect incidents quickly) and a long window (to confirm they're sustained, not spikes). A 5-minute error spike lights up the 1-hour window but barely registers in a 6-hour window — so you don't alert.

graph TD
    ErrRate["Current error rate"] --> Burn["burn rate<br/>= error rate / SLO threshold"]
    Burn --> Short{"Short window<br/>(1h) burn ≥ 14.4×?"}
    Burn --> Long{"Long window<br/>(6h) burn ≥ 14.4×?"}
    Short -->|yes| AndGate{"AND"}
    Long -->|yes| AndGate
    AndGate -->|both| Page["PAGE<br/>(budget dies in ~50h)"]
    Short -->|yes, long=no| Ignore["suppress<br/>(transient spike)"]
    Long -->|yes, short=no| Ignore

The AND gate is what turns raw burn-rate into actionable paging. A single-window check fires on every 5-minute blip; a two-window check confirms the incident is still hot after enough signal has accumulated to separate noise from a real outage.

groups:
  - name: error_budget_burn
    rules:
      # CRITICAL: 14.4× burn sustained for 1h + 6h
      - alert: ErrorBudgetBurnCritical
        expr: |
          (1 - sum(rate(http_requests_total{service="checkout", status_class=~"2xx|3xx"}[1h]))
               / sum(rate(http_requests_total{service="checkout"}[1h])))
          / (1 - 0.9995) > 14.4
          AND
          (1 - sum(rate(http_requests_total{service="checkout", status_class=~"2xx|3xx"}[6h]))
               / sum(rate(http_requests_total{service="checkout"}[6h])))
          / (1 - 0.9995) > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error budget burning at 14.4× sustainable rate (50 hr to exhaustion)"
 
      # HIGH: 6× burn sustained for 6h + 1d
      - alert: ErrorBudgetBurnHigh
        expr: |
          (1 - sum(rate(http_requests_total{service="checkout", status_class=~"2xx|3xx"}[6h]))
               / sum(rate(http_requests_total{service="checkout"}[6h])))
          / (1 - 0.9995) > 6
          AND
          (1 - sum(rate(http_requests_total{service="checkout", status_class=~"2xx|3xx"}[1d]))
               / sum(rate(http_requests_total{service="checkout"}[1d])))
          / (1 - 0.9995) > 6
        for: 15m
        labels:
          severity: high
        annotations:
          summary: "Error budget burning at 6× rate (~5 days to exhaustion)"

Add these queries to your Grafana dashboard for real-time visibility:

# Error budget remaining (0 to 1)
1 - ((1 - sum(rate(http_requests_total{service="checkout", status_class=~"2xx|3xx"}[30d]))
           / sum(rate(http_requests_total{service="checkout"}[30d])))
     / (1 - 0.9995))
 
# Hours until budget exhaustion at current 6h burn rate
(1 - 0.9995) * 30 * 24
/ max((1 - sum(rate(http_requests_total{service="checkout", status_class=~"2xx|3xx"}[6h]))
           / sum(rate(http_requests_total{service="checkout"}[6h]))), 0.000001)

Step 4: Error budgets in practice

In our experience, when a team's error budget hits a critical threshold — say, 22% remaining with high burn rate — the policy triggers automatically: feature freeze until budget recovers. The features ship next quarter. Incident frequency drops because the team focused on reliability instead of shipping under pressure.

Without error budgets, that conversation is a negotiation between engineers (who feel the pain) and product managers (who feel shipping pressure). With error budgets, the policy makes the decision weeks before the pressure arrives. Build the framework before you need it.

Common error budget policies in practice:

  • Quarterly budget reviews: Services consistently consuming <30% of budget get tighter SLOs in the next quarter (cost-benefit trade-off).
  • Deployment gating: Pipeline checks error budget before releasing. If budget < 25%, deployment is blocked with "wait for recovery or file exception" message[Beyer et al., 2016]. (Google's SRE Workbook documents this pattern.)
  • Post-mortem budget tracking: Incident post-mortems include a "budget impact" section. A 2% budget burn from a partial outage is triaged differently than a 15% burn from a complete outage of the same duration.

Production checklist

  • Instrumentation: Wrap HTTP router with SLIMiddleware. Verify http_requests_total and http_request_duration_seconds appear in Prometheus with correct labels.
  • SLO definition: Write SLO document with error budget policy. Get sign-off from engineering lead and product manager.
  • Histogram buckets: Include your latency SLO threshold (e.g., 0.2s) as an explicit bucket boundary. Without it, PromQL measurement is imprecise.
  • Alerting rules: Deploy multi-window burn-rate alerts (critical: 14.4× for 1h+6h; high: 6× for 6h+1d). Test PagerDuty integration.
  • Dashboards: Add "error budget remaining" and "hours to exhaustion" queries to main SRE dashboard.
  • Monitoring: Track actual vs. predicted budget consumption monthly. Adjust SLO targets quarterly based on performance data.

Multi-window burn-rate alerts you can paste directly

The two-tier policy from Google's SRE workbook — page on fast burn (14.4× depleting the monthly budget in ~2 days unchecked), ticket on slow burn (6× depleting in ~5 days). The 1h+6h pair catches localised incidents; the 6h+1d pair catches sustained drift the short-window check misses:

# alerts/slo.yml — 99.9% availability SLO over a 30-day window
groups:
  - name: slo.checkout.availability
    interval: 30s
    rules:
      - record: slo:sli_error:ratio_rate1h
        expr: |
          sum(rate(http_requests_total{job="checkout",status=~"5.."}[1h]))
          / sum(rate(http_requests_total{job="checkout"}[1h]))
 
      - record: slo:sli_error:ratio_rate6h
        expr: |
          sum(rate(http_requests_total{job="checkout",status=~"5.."}[6h]))
          / sum(rate(http_requests_total{job="checkout"}[6h]))
 
      - record: slo:sli_error:ratio_rate1d
        expr: |
          sum(rate(http_requests_total{job="checkout",status=~"5.."}[1d]))
          / sum(rate(http_requests_total{job="checkout"}[1d]))
 
      - alert: ErrorBudgetBurnFast
        expr: |
          slo:sli_error:ratio_rate1h > (14.4 * 0.001)
          and
          slo:sli_error:ratio_rate6h > (6 * 0.001)
        for: 2m
        labels: { severity: critical, team: checkout }
        annotations:
          summary: "Checkout burning 14.4× SLO budget — page now"
          description: "At current 1h burn rate the 30-day budget exhausts in <2 days. Investigate before users notice."
 
      - alert: ErrorBudgetBurnSlow
        expr: |
          slo:sli_error:ratio_rate6h > (6 * 0.001)
          and
          slo:sli_error:ratio_rate1d > (1 * 0.001)
        for: 15m
        labels: { severity: warning, team: checkout }
        annotations:
          summary: "Checkout burning 6× SLO budget — open a ticket"
          description: "Sustained low-grade error rate. Will exhaust budget in ~5 days if uncorrected."

The query that powers your "budget remaining" dashboard tile — emits the percentage of the 30-day budget still available, so the on-call sees 73% remaining (4d 18h ahead of forecast) instead of guessing from raw error rates:

# slo_error_budget_remaining_ratio — paste as a recording rule for caching
1 - (
  sum(increase(http_requests_total{job="checkout",status=~"5.."}[30d]))
  /
  (
    sum(increase(http_requests_total{job="checkout"}[30d]))
    * 0.001    # the 0.1% allowed-error fraction = (1 - 0.999)
  )
)

The 0.001 is the only knob — change it to match your SLO target (0.0001 for 99.99%, 0.005 for 99.5%). Everything else stays[Beyer et al., 2016].

Pair it with a Slack notification webhook so engineers see the budget tile drift before the page fires:

# alertmanager.yml — route the warning-tier ErrorBudgetBurnSlow to Slack
receivers:
  - name: checkout-slack
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_URL}'
        channel: '#oncall-checkout'
        title: '{{ .CommonLabels.alertname }} — {{ .CommonLabels.severity }}'
        text: |
          {{ range .Alerts }}
          *Summary:* {{ .Annotations.summary }}
          *Burn rate:* {{ .Annotations.description }}
          *Runbook:* https://runbooks.example/slo-burn
          {{ end }}
        send_resolved: true

A simple budget-remaining query you can run ad-hoc (e.g., from promtool query instant) when an executive asks "how close are we to violating the SLO this month?":

promtool query instant http://prometheus:9090 \
  'slo_error_budget_remaining_ratio{job="checkout"} * 100'
# Expected: 73.4 (73.4% of the 30-day budget still available)

User-journey SLOs vs API-endpoint SLOs

Per-endpoint SLOs lie about user experience. A checkout flow that calls product-service, cart-service, payment-service, and order-service can have every endpoint at 99.95% availability and still drop one in five hundred customers — because each microservice failure compounds along the journey. The user does not care that POST /payments/charge met its SLO. The user cares that the journey from "Add to cart" to "Order confirmed" worked end-to-end[Beyer et al., 2016].

Composite SLOs measure what users actually experience. A checkout journey SLO instruments the user-facing funnel — typically with a journey-id attached at the gateway and propagated through every hop — and counts a journey successful only when every step in the chain succeeds within the latency budget. The math is unforgiving: four 99.95% services chained sequentially produce a journey availability of 0.9995^4 = 99.80%, which equals 86 minutes of monthly downtime instead of the 21.6 minutes each individual service promises[Beyer et al., 2016].

The instrumentation pattern is a journey counter recorded at the terminal step, with a label for the failed stage when the journey fails. This lets you keep one SLO for the whole flow and still answer "where are journeys dying?" without joining four endpoint metrics:

// Emit one metric per completed journey, labelled by terminal status and failed stage.
var journeyOutcomes = promauto.NewCounterVec(
    prometheus.CounterOpts{
        Name: "checkout_journey_outcomes_total",
        Help: "Checkout journeys by terminal status and failed stage",
    },
    // failed_stage = "" on success; "cart" | "payment" | "fulfilment" | ... on failure
    []string{"outcome", "failed_stage"},
)
 
func RecordJourney(ctx context.Context, j Journey) {
    if j.Err == nil && j.Duration <= 4*time.Second {
        journeyOutcomes.WithLabelValues("success", "").Inc()
        return
    }
    stage := j.FailedStage // populated by the step that returned the error
    if stage == "" && j.Duration > 4*time.Second {
        stage = "latency"
    }
    journeyOutcomes.WithLabelValues("failure", stage).Inc()
}

The recording rule then computes both the headline journey SLI and a per-stage attribution view that points the on-call at the right service without dashboard hunting:

# Headline: journey-level success rate over 30 days (the SLO that matters)
sum(rate(checkout_journey_outcomes_total{outcome="success"}[30d]))
/ sum(rate(checkout_journey_outcomes_total[30d]))
 
# Attribution: which stage is consuming the most journey error budget right now?
topk(3,
  sum by (failed_stage) (
    rate(checkout_journey_outcomes_total{outcome="failure", failed_stage!=""}[6h])
  )
)

In our experience, the attribution query is the operational payoff. When a journey-level burn-rate alert fires, the first dashboard tile shows that 73% of failures in the last six hours are tagged failed_stage="payment", so the on-call pages the payments team instead of opening four runbooks. Without the journey label, you would see only "checkout availability dropped" and spend twenty minutes correlating endpoint dashboards by hand.

Two non-obvious rules from running journey SLOs in production:

  • Set the journey latency budget at the user-perceived boundary, not the sum of per-service p99s. If product expects checkout to feel snappy under 4 seconds end-to-end, that 4 seconds is the SLO — even if the four downstream services budget for 1.5 seconds each and "fit" on paper. Tail amplification (a 1% slow rate at each of four services compounds to ~3.9% slow journeys) means the per-service math always understates the journey p99[Dean & Barroso, 2013].
  • Do not double-count budget. Keep per-service SLOs as health signals that page the owning team, but only the journey SLO gates the deployment policy. Otherwise a payments deploy gets blocked because cart-service burned its independent budget on an unrelated incident, which trains teams to ignore the policy.

The endpoint SLOs still earn their keep — they tell the cart team their service is fine when checkout is failing — but the journey SLO is what product, on-call, and the error-budget policy meeting all reference. Build it the moment you have more than two services in a critical user flow.


Frequently Asked Questions

What is an error budget and how is it calculated?

An error budget is the inverse of your SLO target — the amount of unreliability you can tolerate. A 99.9% SLO gives you a 0.1% error budget, which translates to about 43 minutes of allowed downtime per month[Beyer et al., 2016]. When the budget is exhausted, the policy dictates halting risky deployments until it recovers.

What is the difference between an SLO and an SLA?

An SLO (Service Level Objective) is an internal reliability target your team sets and monitors. An SLA (Service Level Agreement) is an external contractual commitment with financial penalties for violations. SLOs should always be stricter than SLAs to provide a safety margin.

What is multi-window burn rate alerting?

Multi-window burn rate alerting (from the Google SRE Workbook) triggers alerts based on how fast you are consuming your error budget relative to the budget period. It uses multiple time windows (e.g., 1-hour and 6-hour) to distinguish sustained burns from brief spikes, reducing alert noise while catching real incidents.

How do you choose the right SLO target for a service?

Base your SLO on user expectations and business impact, not on what your system currently achieves. Start with a slightly lower target than current performance, measure for a quarter, then tighten. Four nines (99.99%) means only 4 minutes of monthly downtime — most services should start at 99.5-99.9%[Beyer et al., 2016].

Keep Reading

BackendBytes Engineering Team
BackendBytes Engineering Team

Engineering Team

A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.

Read Next