Postgres EXPLAIN Cheat Sheet: Reading Query Plans Like a Pro
Key Takeaways
- →Compare estimated vs actual rows — divergence >10x signals stale statistics; run ANALYZE on affected tables immediately
- →Nested Loop + large outer table = missing index on the join column; the planner thinks the outer table is small and sequentially probes the inner for each row
- →Rows Removed by Filter >> output means the WHERE clause isn't indexed; the index returns candidates, but filters away most of them — use partial indexes for skewed predicates
- →Hash Join with Batches > 1 means the hash table spilled to disk; either raise work_mem for the session or reduce the input size with earlier WHERE/GROUP BY aggregation
The classic Postgres production query regression. A dashboard query that loaded in 800 ms at launch now takes 12 seconds on 8 million rows. EXPLAIN ANALYZE shows a Nested Loop over a Seq Scan — the planner estimated 200 rows but actual was 4.2 million. A missing composite index on
(tenant_id, created_at)turned a 2 ms Index Scan into a full table scan. We debugged this exact regression on multiple production teams.
How to read a Postgres EXPLAIN plan
The dashboard query that loaded in 800ms at launch now takes 12 seconds on 8M rows. EXPLAIN ANALYZE shows a Nested Loop over a Seq Scan — the planner estimated 200 rows but actual was 4.2 million. A missing composite index on (tenant_id, created_at) turned a 2ms Index Scan into a full table scan. This cheat sheet teaches you to read those numbers and spot the pattern in under a minute.
EXPLAIN shows a tree of operations (read bottom-up)[PostgreSQL Docs]. Compare estimated rows vs actual rows — in our experience, divergence beyond ~10× is the threshold where the planner reliably picks the wrong join strategy and you should run ANALYZE. Use BUFFERS to spot I/O bottlenecks. Watch for Nested Loops with large outer tables, disk spills, and filter throwaway.
- Add
BUFFERSoption to spot cache misses vs CPU bottlenecks - Compare estimated vs actual row counts; in our experience, divergence beyond ~10× = stale stats requiring ANALYZE
- Nested Loop + large outer table = missing index on join column
The quick start — node types
Plan trees are read bottom-up — leaves are scans, internals are joins/aggregations, root is the final shape:
graph TB
Root[Limit + Sort<br/>top operator<br/>final result] --> J1[Hash Join]
J1 --> Build[Build side<br/>smaller relation]
J1 --> Probe[Probe side<br/>larger relation]
Build --> S1[Seq Scan customers<br/>1000 rows]
Probe --> Idx[Index Scan orders<br/>using idx_customer_id]
Idx --> NL[Nested Loop<br/>inside larger plan]
NL --> Outer[Outer: small batch<br/>10 rows]
NL --> Inner[Inner: index lookup<br/>per outer row]
style Root fill:#dfd
style J1 fill:#ffd
style S1 fill:#fdd
style Idx fill:#dfd
style NL fill:#ffd
Postgres plans are trees. Read bottom-up. Each node takes input from children, transforms, emits to parent.
| Node | What it does | Good sign | Bad sign |
|---|---|---|---|
| Seq Scan | Read every row | Small tables, bitmap input | Large table + WHERE clause (missing index) |
| Index Scan | Walk index, fetch rows | Selective predicate (< 5% of table) | Fetching >10% via index (seq scan faster) |
| Index Only Scan | Answer from index alone | Covering query, visibility map warm | Heap fetches (visibility map stale) |
| Bitmap Heap Scan | Combine multiple indexes | Multi-column AND/OR on indexes | High heap recheck (working set > shared_buffers) |
| Nested Loop | Outer loop: probe inner | Tiny outer, indexed inner side | Large outer table, unindexed inner |
| Hash Join | Build hash on small side | Both sides sized correctly | Spill to disk (Batches: > 1) |
| Sort | Sort input rows | Fits in work_mem (quicksort) | Disk sort (external merge Disk: ...kB) |
Reading the numbers
[PostgreSQL Docs, Resource Usage]Seq Scan on orders (cost=0.00..18334.00 rows=1000000 width=8) (actual time=0.010..85.2 rows=998422 loops=1)
cost— startup .. total (units arbitrary; compare estimates only).rows=1000000— planner estimate. Compare with actual. >10x divergence = stale stats.actual time— startup .. total milliseconds per loop.loops— execution count. Multiply time by loops for total cost. [PostgreSQL Docs, Resource Usage]
Diagnose by symptom, not by node type
When EXPLAIN ANALYZE prints a 200-line plan tree, the question is "where does the time go?" Route by symptom:
graph TD
Slow[Query is slow] --> Where{Where does<br/>the time go?}
Where -->|Single node, big actual time| Hot[Find that node<br/>read its inputs and filter]
Where -->|Many small nodes, sum is big| Loops[Look at loops<br/>multiply time per loop]
Hot --> Type{Node type?}
Type -->|Seq Scan on big table| SeqIdx[Add index on<br/>WHERE / JOIN column]
Type -->|Nested Loop, big outer| Nested[Index inner table<br/>or rewrite as Hash Join]
Type -->|Sort, Disk: ...kB| Sort[Raise work_mem<br/>or add covering index]
Type -->|Hash Join, Batches over 1| Hash[Raise work_mem<br/>or pre-aggregate]
Type -->|Index Scan, slow| IdxSlow{Estimate vs<br/>actual rows?}
IdxSlow -->|Off over 10x| Stats[Run ANALYZE<br/>raise default_statistics_target]
IdxSlow -->|Close| Selectivity[Index not selective enough<br/>add columns to make composite]
Loops -->|Loops over 10000| LoopMany[Same answer as Nested Loop<br/>index inner table]
style SeqIdx fill:#dfd
style Nested fill:#dfd
style Sort fill:#dfd
style Hash fill:#dfd
style Stats fill:#dfd
style Selectivity fill:#ffd
The diagram is the cheat-sheet for performance triage[PostgreSQL Docs]. The red flags table below is the same answers in tabular form — pick whichever shape matches your debugging style.
Red flags and fixes
[PostgreSQL Docs]| Red flag | Cause | Fix |
|---|---|---|
| Nested Loop, large outer table | Per-row inner scan | Add index on join column |
Rows Removed by Filter >> output | Throwaway reads | Index filter column, use partial index |
Buffers: read=X (high) | Cache miss | Warm cache, raise shared_buffers, replica |
Batches: > 1 (Hash Join) | Spill to disk | Raise work_mem, pre-aggregate |
Sort ... Disk: ...kB | work_mem overflow | Raise work_mem, index for sorted output |
| Estimate off >10x | Stale statistics | Run ANALYZE, raise default_statistics_target |
EXPLAIN ANALYZE safely
[PostgreSQL Docs]-- Plan only (no execution, safe on prod).
EXPLAIN SELECT * FROM orders WHERE customer_id = 42;
-- Plan + actual numbers (runs query; use replica).
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;
-- Full diagnostics: buffers + settings + JSON.
EXPLAIN (ANALYZE, BUFFERS, SETTINGS, FORMAT JSON)
SELECT * FROM orders WHERE customer_id = 42;Always include BUFFERS on production investigations. It reports shared-buffer cache hits vs disk reads — the single best signal for I/O vs CPU bottleneck.
For writes, wrap in a rolled-back transaction:
BEGIN;
EXPLAIN ANALYZE UPDATE orders SET status='shipped' WHERE id = 42;
ROLLBACK;Six anti-patterns
[Postgres B-tree]SELECT *from wide tables — defeats index-only scans. Project only needed columns.WHERE lower(email) = $1— function on column blocks index. Fix: expression index or stored lowercase.WHERE created_at::date = $1— cast kills index. Use range:WHERE created_at >= $1 AND created_at < $1::date + '1 day'.OFFSET 10000 LIMIT 50— scans 10k rows then discards. Use keyset:WHERE (created_at, id) < ($ts, $id) LIMIT 50.IN (SELECT ...)with large subquery — rewrite asJOINorEXISTSfor better join strategy.COUNT(col)instead ofCOUNT(*)—COUNT(*)counts rows;COUNT(col)counts non-nulls. Never confuse them.
EXPLAIN (FORMAT JSON) for Tooling
For machine-readable output, use EXPLAIN (FORMAT JSON, ANALYZE, BUFFERS). The JSON format feeds plan visualisers, regression-detection scripts, and slow-query review pipelines:
EXPLAIN (FORMAT JSON, ANALYZE, BUFFERS, SETTINGS, WAL)
SELECT order_id, total_cents
FROM orders
WHERE customer_id = 42 AND status = 'pending'
ORDER BY created_at DESC LIMIT 50;Key JSON fields a regression-detection script should check:
Plan.Total Cost— planner's estimated total cost; spikes vs. baseline indicate stats drift or new index neededPlan.Actual Total Time— wall-clock execution time; the SLO indicatorPlan.Plans[].Plan RowsvsActual Rows— divergence over 10× signals stale stats; runANALYZEPlan.Shared Hit BlocksvsShared Read Blocks— cache hit ratio; low hit ratio means warm cache or raiseshared_buffersPlan.Plans[].Node Type— change fromIndex ScantoSeq Scanbetween deploys is a smoke alarm
A simple regression check in CI:
# pseudo-code: load baseline plan + new plan, compare key fields
import json
def regression_score(baseline: dict, current: dict) -> dict:
b_plan = baseline["Plan"]
c_plan = current["Plan"]
return {
"cost_ratio": c_plan["Total Cost"] / b_plan["Total Cost"],
"time_ratio": c_plan["Actual Total Time"] / b_plan["Actual Total Time"],
"node_changed": b_plan["Node Type"] != c_plan["Node Type"],
"estimate_drift": abs(c_plan.get("Plan Rows", 0) - c_plan.get("Actual Rows", 0))
/ max(c_plan.get("Actual Rows", 1), 1),
}Wire this into your nightly slow-query report so plans that regress (cost > 2× baseline, node-type changed, estimate-drift > 10×) page the on-call DBA before they cause a customer incident.
For interactive review, paste JSON output into explain.depesz.com (free, no signup) — colour-codes hot nodes by time and rows-removed-by-filter.
Incident-time queries: paste these into psql at 3am
When the on-call page fires and the dashboard shows latency climbing, EXPLAIN on a single query is rarely the right starting point. You need cluster-wide signals: which queries are burning CPU, which sessions are blocking each other, which tables have rotted, and which indexes are paying their rent. The four queries below are the ones we paste into psql first, and they answer those questions in roughly that order.
Start with the slowest queries by cumulative execution time. total_exec_time (renamed from total_time in PG13+) is the right ranking — mean_exec_time hides queries that are individually fast but called millions of times per minute:
-- Top-10 slowest queries by cumulative execution time.
-- Requires: CREATE EXTENSION pg_stat_statements; in postgresql.conf.
SELECT
substring(query, 1, 80) AS short_query,
calls,
round(total_exec_time::numeric, 1) AS total_ms,
round(mean_exec_time::numeric, 2) AS mean_ms,
round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 1) AS pct_of_total,
rows
FROM pg_stat_statements
WHERE query NOT ILIKE '%pg_stat_statements%'
ORDER BY total_exec_time DESC
LIMIT 10;If pct_of_total for a single query is over 30%, you have a clear culprit. If the top 10 are all under 5%, the load is broadly distributed and you should look at connection count or shared_buffers instead. Reset the counters after each investigation with SELECT pg_stat_statements_reset(); to get clean before/after numbers when you ship a fix. [PostgreSQL Docs]
Next, find blocking locks. The classic 3am page is "writes are stalled" — almost always one long-running transaction holding RowExclusiveLock on a table that everything else needs. This query reports the blocking PID, the blocked PID, and the SQL of both so you can decide which to kill:
-- Live blocking-lock detector. Shows which session is blocking which.
SELECT
blocked.pid AS blocked_pid,
blocked.usename AS blocked_user,
blocking.pid AS blocking_pid,
blocking.usename AS blocking_user,
blocking.state AS blocking_state,
age(now(), blocking.xact_start) AS blocking_xact_age,
substring(blocked.query, 1, 60) AS blocked_query,
substring(blocking.query, 1, 60) AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE NOT blocked.pid = blocking.pid
ORDER BY blocking_xact_age DESC;pg_blocking_pids() (PG9.6+) is the canonical way to walk the lock graph — it correctly handles transitive blocking through multiple sessions, which a manual pg_locks join does not. If blocking_xact_age is over a minute and blocking_state is idle in transaction, the application forgot to commit; kill it with SELECT pg_terminate_backend(blocking_pid);.
Now check for table bloat. When autovacuum can't keep up with churn, dead tuples accumulate, the heap grows, and sequential scans get slower even though the live row count is unchanged. The pgstattuple extension gives you the truth without estimating:
-- Bloat estimate per table. Requires CREATE EXTENSION pgstattuple;
-- Note: scans the full table, so run on a replica for tables > 100GB.
SELECT
schemaname || '.' || tablename AS table_name,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
round(stat.dead_tuple_percent::numeric, 1) AS dead_pct,
pg_size_pretty(stat.dead_tuple_len) AS dead_bytes,
round(stat.free_percent::numeric, 1) AS free_pct
FROM pg_tables
CROSS JOIN LATERAL pgstattuple(schemaname || '.' || tablename) AS stat
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
AND pg_total_relation_size(schemaname || '.' || tablename) > 100 * 1024 * 1024
ORDER BY stat.dead_tuple_len DESC
LIMIT 20;Anything over 20% dead_pct warrants a VACUUM (VERBOSE, ANALYZE) tablename; and a look at autovacuum tuning (autovacuum_vacuum_scale_factor, autovacuum_naptime). For tables over 100GB, prefer pg_repack over VACUUM FULL — it rewrites the heap without taking an exclusive lock for the duration. [pg_repack]
Finally, find indexes that are pure write tax — indexes that exist, slow down every INSERT/UPDATE, and serve zero reads. Drop them. This is the single highest-ROI cleanup you can do on most aging databases:
-- Indexes that have never been scanned since stats were reset.
-- Excludes UNIQUE constraints (still needed for correctness) and primary keys.
SELECT
schemaname || '.' || relname AS table_name,
indexrelname AS index_name,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan AS times_used,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes ui
JOIN pg_index i ON ui.indexrelid = i.indexrelid
WHERE idx_scan = 0
AND NOT i.indisunique
AND NOT i.indisprimary
AND pg_relation_size(ui.indexrelid) > 1024 * 1024
ORDER BY pg_relation_size(ui.indexrelid) DESC
LIMIT 20;Before you DROP INDEX, verify the stats window is wide enough — SELECT stats_reset FROM pg_stat_database WHERE datname = current_database(); should be at least one full business cycle ago (a week minimum, ideally a month). Indexes that look unused on Monday may be hammered during Friday batch jobs. When in doubt, mark them invalid first with UPDATE pg_index SET indisvalid = false WHERE indexrelid = ...::regclass; — this lets you roll back instantly if a query starts seqscanning instead.
auto_explain: capture slow plans before the user complains
Running EXPLAIN ANALYZE after the fact only works when you can reproduce the slow run, and production traffic rarely hands you the same row distribution twice. The auto_explain extension solves this by logging the actual plan of every query that exceeds a duration threshold — so the next time the dashboard query takes 12 seconds, you already have the plan in your log stream, with the real row counts that triggered the regression.
Enable it cluster-wide by editing postgresql.conf and reloading. The configuration below logs the full plan with buffer counters and trigger timing for any statement that exceeds 500 ms, samples 10% of fast queries to keep noise down, and skips nested statements to avoid blowing up the log volume on stored-procedure-heavy workloads: [PostgreSQL Docs]
-- Add to postgresql.conf and run SELECT pg_reload_conf();
-- shared_preload_libraries requires a full restart, the rest reload live.
ALTER SYSTEM SET shared_preload_libraries = 'auto_explain,pg_stat_statements';
ALTER SYSTEM SET auto_explain.log_min_duration = '500ms';
ALTER SYSTEM SET auto_explain.log_analyze = on;
ALTER SYSTEM SET auto_explain.log_buffers = on;
ALTER SYSTEM SET auto_explain.log_timing = on;
ALTER SYSTEM SET auto_explain.log_triggers = on;
ALTER SYSTEM SET auto_explain.log_verbose = off;
ALTER SYSTEM SET auto_explain.log_format = 'json';
ALTER SYSTEM SET auto_explain.log_nested_statements = off;
ALTER SYSTEM SET auto_explain.sample_rate = 1.0;
SELECT pg_reload_conf();The log_format = 'json' setting matters: it lets your log shipper (Loki, Splunk, Datadog) parse the plan as structured fields rather than treating it as a multi-line text blob. With JSON, you can write alerts like "page on-call when any query has Plan.Total Cost over 100000 and Node Type = Seq Scan" without regex-scraping log lines. log_timing = on adds about 5% overhead on a CPU-bound workload — measure first, then decide whether the plan visibility is worth it. On most OLTP systems with auto_explain set to log only queries above 500 ms, the overhead is in the noise. [PostgreSQL Docs]
One caveat: auto_explain.log_analyze = on instruments every query at the executor level, even if the query finishes below the duration threshold and is never logged. The instrumentation cost is the price you pay for being able to see plans at all. If your p99 latency budget is tight, set auto_explain.sample_rate to 0.1 (sample 10% of statements) — you still catch chronic offenders without paying instrumentation cost on every fast query. [PostgreSQL Docs]
pg_hint_plan: emergency override for a planner gone wrong
Sometimes the planner picks the wrong plan and you cannot fix the root cause in the moment. Stats refresh is queued, the index build will take six hours, and the customer is complaining now. The pg_hint_plan extension lets you pin a specific join order, scan type, or join method per query via SQL comments — without changing application code or postgresql.conf. Treat this as a circuit breaker, not a long-term fix. Hints freeze the plan against the planner's future improvements and hide the underlying problem.
The hint syntax is a structured comment immediately before the SELECT. The example below forces a Hash Join between orders and customers, an index scan on orders using idx_orders_customer_status, and a leading-join order so customers is the build side:
-- Force Hash Join with explicit build side and index scan.
-- Remove this hint after stats are refreshed and the planner picks the right plan organically.
/*+
HashJoin(o c)
IndexScan(o idx_orders_customer_status)
Leading((c o))
*/
SELECT o.id, o.total_cents, c.email
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
AND o.created_at >= now() - interval '24 hours'
ORDER BY o.created_at DESC LIMIT 100;After applying the hint, verify it took effect with EXPLAIN (ANALYZE, BUFFERS) — pg_hint_plan silently ignores invalid hints rather than throwing an error, so a typo gives you no signal. If the plan looks unchanged, check SHOW pg_hint_plan.message_level; (set to notice for visible feedback during testing) and SHOW pg_hint_plan.enable_hint; (must be on).
Track every hint you add in a hints_audit table with the issue ticket, expected removal date, and the underlying root cause. Stale hints rot fast — a hint pinned in 2024 against a 100GB table is actively wrong by the time the table is 800GB and the index has been rebuilt twice. Audit quarterly and remove any hint whose root cause has been fixed.
Plan stability tracking with pg_stat_statements over time
pg_stat_statements shows you the current top-N slow queries, but it does not show you whether a specific query has been getting slower over time — and that's the question that matters for capacity planning and slow-rollout regression detection. The pattern is a periodic snapshot job that copies the relevant counters into a history table, then resets the live stats so the next window starts clean.
The schema below stores per-query mean and total time, plus shared-buffer hit rate, with a snapshot timestamp. A nightly cron job inserts the current row, then resets:
-- One-time setup: history table for plan-stability snapshots.
CREATE TABLE IF NOT EXISTS pg_stat_statements_history (
snapshot_at timestamptz NOT NULL DEFAULT now(),
queryid bigint NOT NULL,
query_fingerprint text NOT NULL,
calls bigint NOT NULL,
total_exec_time_ms double precision NOT NULL,
mean_exec_time_ms double precision NOT NULL,
rows_emitted bigint NOT NULL,
cache_hit_ratio double precision NOT NULL,
PRIMARY KEY (snapshot_at, queryid)
);
CREATE INDEX ON pg_stat_statements_history (queryid, snapshot_at DESC);
-- Cron job (runs every hour or daily): snapshot then reset.
INSERT INTO pg_stat_statements_history (
queryid, query_fingerprint, calls,
total_exec_time_ms, mean_exec_time_ms, rows_emitted, cache_hit_ratio
)
SELECT
queryid,
substring(query, 1, 200),
calls,
total_exec_time,
mean_exec_time,
rows,
CASE WHEN (shared_blks_hit + shared_blks_read) = 0 THEN 1.0
ELSE shared_blks_hit::float / (shared_blks_hit + shared_blks_read)
END
FROM pg_stat_statements
WHERE calls > 10;
SELECT pg_stat_statements_reset();With a few weeks of history, regression questions become trivial. "Has this query gotten slower since the v2.4 deploy on April 19th?" is now a single comparison query against pg_stat_statements_history filtered by queryid and snapshot_at. Plot mean execution time over time and you'll see exactly when the cliff happened — usually the day someone added a column without updating an index, or the day a partition crossed the boundary where the planner stopped using a partial index. The snapshot window also gives you a safer baseline for the regression-detection script earlier in this article: instead of comparing against a single hand-picked baseline plan, you can compare today's mean execution time to the rolling 30-day p50 and alert when the divergence exceeds a threshold for at least three consecutive snapshots, which filters out noise from one-off slow runs.
Frequently Asked Questions
EXPLAIN vs EXPLAIN ANALYZE?
EXPLAIN returns a cost estimate without executing the query. EXPLAIN ANALYZE actually runs the query and returns real numbers (row counts, times, buffers). Run ANALYZE on replicas only — never on production writes, since it executes mutations like any other statement.
Estimate way off (e.g., 10k estimated, 100 actual)?
Underestimating is worse than overestimating. Underestimates push the planner toward Nested Loops that explode at runtime; overestimates are conservative. If skew is recurrent, raise default_statistics_target to 500 and re-ANALYZE the affected tables.
What's "Rows Removed by Filter"?
Rows read from the heap but discarded by a WHERE clause that wasn't pushed into the index. High values signal a missing index or a partial-index opportunity for the predicate.
Hash Batches > 1?
The hash table spilled to disk because it didn't fit in work_mem. Either raise work_mem for the session, or reduce the input size with an earlier WHERE / GROUP BY pre-aggregation.
Disable Nested Loop to force Hash Join?
Last resort. Prefer fixing the root cause: add an index on the join column or refresh statistics. Disabled planner features (set enable_nestloop = off) cause future outages when data shape changes and the planner is no longer free to pick the right strategy.
Keep Reading
- Indexes and query planning strategies
- Postgres Query Planner Internals
- Zero-Downtime Database Migrations — when an EXPLAIN points to "this needs a different schema."
- Caching Strategies at Scale — when "the query is fundamentally expensive" pushes you off the database.
- Production Go API Design — context-aware DB calls (
QueryContext) so EXPLAIN-discovered slow queries get cancelled.
Engineering Team
Backend engineers writing production-grade references for Go, Java, and distributed systems.
Read Next
Database Indexing Strategies: B-Trees, GIN, GiST, and Production Tuning
B-tree internals, composite index ordering, GIN for full-text search, partial indexes, and preventing index bloat in production.
PostgreSQL Query Planner Internals: From EXPLAIN to Expert Tuning
How PostgreSQL's query optimizer decides, why it gets it wrong, and how to fix it with statistics targets and covering indexes.
Zero-Downtime Database Migrations at Scale
Schema migrations on billion-row tables without downtime: expand-contract, pg_repack, gh-ost, blue-green migrations, and rollbacks.