#database-internals #migrations #postgresql #mysql #devops

Zero-Downtime Database Migrations at Scale

BackendBytes Engineering Team

Dec 22, 2024

15 min read

Zero-Downtime Database Migrations at Scale

Part of Series: PostgreSQL Deep Dive

Lesson 4 of 4

→ALTER TABLE on 900M rows triggered 26 minutes of rewriting + 47 minutes of replication lag; replicas served stale fraud detection data the entire time
→Expand-contract pattern splits breaking schema changes into safe phases: expand (add new column/table), deploy new code, contract (drop old) — each phase is independently deployable and rollback-safe
→pg_repack and gh-ost avoid AccessExclusive locks by creating shadow copies; for PostgreSQL 11+ with nullable columns, ADD COLUMN is nearly free (no table rewrite); with defaults/NOT NULL, use expand-contract
→Backfill large tables in batches with replication-lag throttling; never run one UPDATE on billions of rows — it creates massive transactions, locks, and spikes disk I/O

The ALTER TABLE reported "complete." The replicas hadn't — lag climbed into the tens of minutes, and every consumer reading from a replica saw stale data. A team adds a column to a transactions table with hundreds of millions of rows. The DBA runs the standard ALTER TABLE; it runs for many minutes, then replica monitoring lights up. We've debugged this exact incident on multiple Postgres deployments.

The Replication-Lag Cascade

A single ALTER TABLE that looks harmless is the trigger:

ALTER TABLE transactions
    ADD COLUMN idempotency_key VARCHAR(64) NOT NULL DEFAULT gen_random_uuid()::text;

The root cause: a column added with a volatile default like gen_random_uuid() forces PostgreSQL^{[PostgreSQL Docs]} to rewrite the entire table — every existing row gets a distinct value, so the value can't be stored once in the catalog the way a constant default can. (On PostgreSQL 11+, a nullable ADD COLUMN with no default, or with a constant default, is metadata-only and near-instant — see the quick-start table below. The trap is the NOT NULL DEFAULT <volatile expr> combination, which silently triggers the full rewrite.) Every write during that rewrite has to be replicated afterward, so at sustained write rates hundreds of thousands of writes queue in the replication log by the time the migration finishes.

The migration itself isn't the problem. The cascade effects — replication lag, lock contention, schema coupling — are where teams get into trouble.

TL;DR

Database migrations on billion-row tables can cascade into replication lag, lock contention, and application downtime. Use the expand-contract pattern to split breaking schema changes into safe, independently deployable phases. For operations requiring table rewrites, use pg_repack (PostgreSQL)^{[PostgreSQL Docs]} or gh-ost (MySQL) to avoid table locks entirely.

Expand-contract: add new column + dual-write → deploy new code → drop old column
Backfill large tables in batches with replication-lag throttling
Use CREATE INDEX CONCURRENTLY and online DDL tools to avoid AccessExclusive locks

The Quick-Start: Know Your Lock Level

Before running any DDL, check what lock it acquires. AccessExclusive blocks all reads and writes — on large tables, "brief" means minutes.

Operation	PostgreSQL	MySQL 8	Blocks?
`ADD COLUMN (nullable)`	AccessExclusive (brief)	Instant	No — safe (metadata-only)
`ADD COLUMN (with DEFAULT)`	Pg 11+: Safe	Instant	No — safe on modern versions
`ADD NOT NULL COLUMN`	AccessExclusive	Copy	Yes — full lock
`DROP COLUMN`	AccessExclusive	Instant	Pg: Yes
`CREATE INDEX CONCURRENTLY`	ShareUpdateExclusive	Online	No — safe
`ALTER COLUMN TYPE`	AccessExclusive	Copy	Yes — full rewrite
`ADD FOREIGN KEY`	ShareRowExclusive	Instant	Yes — validates all rows

For operations that require a lock: use pg_repack (PostgreSQL) or gh-ost (MySQL) to avoid table-level locks entirely.

The Expand-Contract Pattern

^{[PostgreSQL Docs]}

The expand-contract pattern (also called "parallel change") splits a breaking schema change into safe, independently deployable phases. Each phase is independently rollback-able, so if Deploy 2 has a bug, you roll back to Deploy 1 without any schema changes — no data loss, no recovery scripts.

graph LR
    subgraph Phase1 ["Phase 1: Expand"]
        E1["ADD COLUMN new_col<br/>(nullable, no lock)"]
    end
    subgraph Phase2 ["Phase 2: Transition"]
        E2["Deploy dual-write code"] --> E3["Backfill old → new<br/>(batched)"]
    end
    subgraph Phase3 ["Phase 3: Contract"]
        E4["Drop old column<br/>(pg_repack / gh-ost)"]
    end

    Phase1 -->|"Deploy 1<br/>rollback-safe"| Phase2
    Phase2 -->|"Deploy 2<br/>rollback-safe"| Phase3

Phase 1 (Expand): Add the new column or table without removing the old one. PostgreSQL's ADD COLUMN (nullable) is nearly free in Pg 11+ — it only stores the definition in the catalog, not in every row. This phase is fully backward-compatible: old code reads/writes the old column, new code is not yet deployed.

Phase 2 (Transition): Deploy application code that writes to both columns and reads from the new one. The old column is still populated by dual-write, so any old code instances that get redeployed by mistake still work. This phase is where you backfill existing rows into the new column (see the batched backfill pattern below).

Phase 3 (Contract): Once all application instances are running new code and no code reads the old column anymore, drop the old column. Use pg_repack or gh-ost to avoid table locks during this step.

Backfill Large Tables Safely

^{[PostgreSQL Docs]}

Never run a single UPDATE on billion-row tables — it creates a massive transaction that locks and spikes disk I/O. The throttled-batch pattern is the entire backfill discipline in one diagram:

graph TD
    Start[Start backfill] --> NextBatch[Read next 1000-10000<br/>rows by primary key]
    NextBatch --> Empty{Any rows<br/>returned?}
    Empty -->|No| Done[Backfill complete]
    Empty -->|Yes| Update[UPDATE chunk<br/>in single TX]
    Update --> CheckLag[Query replication lag:<br/>SELECT pg_replication_lag]
    CheckLag --> Lag{Lag over<br/>threshold?}
    Lag -->|Yes — paused| Wait[Sleep + recheck<br/>every 5 seconds]
    Wait --> CheckLag
    Lag -->|No — within budget| Throttle[Sleep N ms<br/>between batches]
    Throttle --> NextBatch
    Update -->|Failure or context cancelled| Abort[Abort + log<br/>last completed ID<br/>so resume is idempotent]
    style Done fill:#dfd
    style Wait fill:#ffd
    style Abort fill:#fdd

Batch in smaller chunks with replication-lag monitoring:

type BackfillConfig struct {
	BatchSize         int           // rows per batch (tune: 1,000–10,000)
	SleepBetweenBatch time.Duration // throttle between batches
	MaxReplicationLag  time.Duration // pause if lag exceeds this
}
 
func Backfill(ctx context.Context, pool *pgxpool.Pool, cfg BackfillConfig) error {
	log := slog.Default()
	var lastID int64 = 0
 
	for {
		select {
		case <-ctx.Done():
			return ctx.Err()
		default:
		}
 
		// Check replication lag before each batch
		var lagSeconds float64
		if err := pool.QueryRow(ctx, `
			SELECT COALESCE(MAX(EXTRACT(EPOCH FROM replay_lag)), 0)
			FROM pg_stat_replication
		`).Scan(&lagSeconds); err != nil {
			log.Error("replication lag check failed", "error", err)
		}
		lag := time.Duration(lagSeconds * float64(time.Second))
		if lag > cfg.MaxReplicationLag {
			time.Sleep(10 * time.Second)
			continue
		}
 
		// Batch update: CTE to get max ID and count in one round-trip
		var maxID int64
		var batchCount int64
		err := pool.QueryRow(ctx, `
			WITH batch AS (
				SELECT id FROM transactions
				WHERE id > $1 AND idempotency_key IS NULL
				ORDER BY id LIMIT $2
			),
			updated AS (
				UPDATE transactions t
				SET idempotency_key = gen_random_uuid()::text
				FROM batch WHERE t.id = batch.id
				RETURNING t.id
			)
			SELECT COALESCE(MAX(id), 0), COUNT(*) FROM updated
		`, lastID, cfg.BatchSize).Scan(&maxID, &batchCount)
 
		if err != nil {
			return fmt.Errorf("batch update: %w", err)
		}
 
		if maxID == 0 {
			log.Info("backfill complete")
			return nil
		}
 
		lastID = maxID
		log.Info("batch complete", "last_id", lastID, "count", batchCount)
		time.Sleep(cfg.SleepBetweenBatch)
	}
}

During the transition, your application writes to both columns, ensuring both are populated regardless of which code version runs:

func (r *TransactionRepo) Insert(ctx context.Context, tx Transaction) error {
	_, err := r.db.Exec(ctx, `
		INSERT INTO transactions (amount, idempotency_key, legacy_key)
		VALUES ($1, $2, $3)
	`, tx.Amount, tx.IdempotencyKey, tx.IdempotencyKey) // write to both
	return err
}
 
// Feature flag controls which column we read from
func (r *TransactionRepo) Find(ctx context.Context, key string) (*Transaction, error) {
	if r.flags.UseNewColumn() {
		return r.findNewColumn(ctx, key)
	}
	return r.findLegacyColumn(ctx, key)
}

Online DDL Tools: pg_repack and gh-ost

For operations requiring full table rewrites (changing column types, adding NOT NULL constraints), use an online tool that copies the table in the background while keeping writes available. These tools avoid the AccessExclusive locks that would block all reads and writes.

PostgreSQL: pg_repack^[pg_repack] creates a new table with the desired schema, copies all existing rows, captures changes that happen during the copy via triggers, and atomically swaps the tables at the end. It requires superuser access but is non-invasive if you have custom trigger logic (test on a replica first).

# Install: apt install postgresql-<version>-repack
 
# Rewrite a table (equivalent to ALTER TABLE ... ADD COLUMN with NOT NULL)
pg_repack --host primary.db --dbname production --table transactions --no-order
 
# Or rebuild an index online
pg_repack --host primary.db --dbname production --table transactions \
  --index idx_transactions_idempotency_key

MySQL: gh-ost^[gh-ost] — GitHub's online schema tool. Creates a ghost table, copies data in background tailing the binary log, then atomically swaps. It's the gold standard for online MySQL schema changes because MySQL's native ALTER TABLE acquires table-level locks.

gh-ost --host primary.db --database production --table orders \
  --alter="ADD COLUMN idempotency_key VARCHAR(64)" \
  --chunk-size=2000 --max-lag-millis=1500 --execute
 
# Pause/resume a running migration
echo "throttle" | nc -U /tmp/gh-ost.orders.sock
echo "no-throttle" | nc -U /tmp/gh-ost.orders.sock

For indexes, PostgreSQL's CREATE INDEX CONCURRENTLY is the built-in equivalent — it takes a ShareUpdateExclusive lock instead of blocking reads/writes, though it's slower (2-3x) than a regular blocking index build: ^{[Postgres B-tree]}

-- Reads/writes continue (slower than blocking index build, but zero downtime)
CREATE INDEX CONCURRENTLY idx_transactions_idempotency_key
    ON transactions (idempotency_key)
    WHERE idempotency_key IS NOT NULL;

Foreign keys also block for validation. Add as NOT VALID first, then validate separately during low-traffic hours:

ALTER TABLE orders ADD CONSTRAINT fk_orders_customers
    FOREIGN KEY (customer_id) REFERENCES customers(id) NOT VALID;
 
-- Validate takes ShareUpdateExclusive lock (reads/writes continue)
ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_customers;

Migration Best Practices

Set lock_timeout before running DDL — if you can't acquire the lock in 5 seconds, fail fast instead of blocking the entire connection pool:

SET lock_timeout = '5s';
SET statement_timeout = '30min';
ALTER TABLE transactions ADD COLUMN idempotency_key VARCHAR(64);

If a lock_timeout fires, retry during lower-traffic windows. This is vastly preferable to indefinitely blocking the connection pool, which causes application 503s.

Check for blocking transactions before migrations:

SELECT pid, now() - xact_start AS duration, state, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND xact_start < now() - interval '30 seconds'
ORDER BY duration DESC;

A long-running transaction that's idle in a transaction is the #1 blocker for DDL. Identify the application owner, confirm it's safe to terminate, and kill it with pg_terminate_backend(pid) if necessary. Long-abandoned transactions cost migrations dearly.

Monitor replication lag during and after migrations. Alert if it exceeds 30 seconds:

SELECT client_addr, replay_lag
FROM pg_stat_replication
ORDER BY replay_lag DESC;

Replication lag during a migration is your canary. If lag is climbing, your migration is generating writes faster than replicas can keep up. The migration in the opening story generated 208,000 queued writes — a telltale sign that the expansion phase of your migration needs throttling.

Use a dark-launch approach during the transition phase: write to both columns but read from the old one. Shadow-read from the new column to detect bugs before users hit them:

oldResult := r.readOld(ctx, id)
newResult, _ := r.readNew(ctx, id)  // non-blocking error
if !reflect.DeepEqual(oldResult, newResult) {
	log.Error("mismatch", "old", oldResult, "new", newResult)
}
return oldResult  // always return old during dark launch

This catches data inconsistencies that would otherwise ship to production undetected.

For blue-green deployments: expand-contract maps directly. Add new columns (expand), deploy code that dual-writes (transition), then drop old columns (contract) only after confirming blue version is fully drained. Check the load balancer metrics — the blue version must have zero in-flight requests for at least 15 minutes before the contract phase runs.

Column Drops Kill Old App Versions

Dropping a column while the old code is still running causes immediate 500 errors. The blue code's SELECT status FROM orders fails when status no longer exists. Always verify zero active connections to the old version before running the contract phase. A 15-minute soak period after traffic cutover catches long-running requests that started on the old version.

Index builds on replicas: if you run CREATE INDEX CONCURRENTLY on the primary, the replicas will replay it as a blocking CREATE INDEX. Build the index on each replica first, then on the primary — PostgreSQL will skip the build when it replicates. Managed PostgreSQL services (RDS, Cloud SQL) typically don't allow direct DDL on read replicas, so schedule primary index builds during low-traffic windows and monitor replica lag closely.

Production Checklist

Before running any migration on a production table > 1M rows:

Determine the exact lock level (consult the quick-start table above)
Test on a production-sized replica with real load patterns
Estimate duration with EXPLAIN ANALYZE on staging
If > 5 minutes or AccessExclusive lock: use pg_repack/gh-ost
Set lock_timeout and statement_timeout on the migration connection
Have a rollback script ready (DROP COLUMN, reverse migration)
Schedule during low-traffic windows with stakeholder approval
Don't run during deployments (code + schema changes = chaos)
Monitor replication lag (target: < 30 seconds during migration)
Run ANALYZE on the table post-migration to update planner statistics
Keep rollback ready for 48 hours post-migration

Feature-Flag-Driven Migrations

Schema migrations and code deploys are two independent change vectors. Coupling them — "this PR ships the new column AND the new read path" — turns rollbacks into branch-revert exercises that take 20 minutes when the incident clock is already burning. Feature flags decouple them. The schema lives ahead of the code, dual-writes happen behind a flag, and the read switch is a runtime toggle, not a deploy.

The pattern: every step that changes user-visible behaviour gets a kill switch. The flag service (LaunchDarkly, Unleash, or a homegrown table in Postgres) returns a boolean per request. The application reads from the new column only when the flag returns true for that request's percentage rollout cohort. If the new path explodes — wrong index, bad cardinality, missing constraint — flip the flag back to zero. No deploy, no rollback PR, no waiting for CI. The schema is unchanged; the read just routes back to the legacy column that's still being dual-written.

-- Feature flag table read by the application on each request
CREATE TABLE IF NOT EXISTS feature_flags (
    name              TEXT PRIMARY KEY,
    rollout_percent   SMALLINT NOT NULL DEFAULT 0
        CHECK (rollout_percent BETWEEN 0 AND 100),
    enabled_for_users BIGINT[] NOT NULL DEFAULT '{}',
    updated_at        TIMESTAMPTZ NOT NULL DEFAULT now()
);
 
-- Start at 0% — schema is live, code is dual-writing, no reads use it
INSERT INTO feature_flags (name, rollout_percent)
VALUES ('read_idempotency_key_column', 0)
ON CONFLICT (name) DO NOTHING;
 
-- Bump to 1% canary, then 10%, 50%, 100% as p99 latency holds steady
UPDATE feature_flags
SET rollout_percent = 10, updated_at = now()
WHERE name = 'read_idempotency_key_column';

The dual-write phase runs at zero rollout for 24-48 hours before the first canary. That window catches drift before any user reads the new column: a nightly diff job samples 100,000 rows, compares legacy and new values, and pages if mismatches exceed 0.01%. Catching a dual-write bug at 0% rollout is a Slack message; catching it at 50% rollout is an incident. ^{[Beyer et al., 2016]}

Backfill Backpressure with Chunked UPDATE

The Go backfill loop earlier handles batching, but production-grade backfills need backpressure that responds to more than just replication lag. Hot-key contention, vacuum I/O, and connection pool saturation all degrade application latency in ways replication lag misses. The chunked UPDATE pattern checks four signals between batches and adapts batch size dynamically.

-- Adaptive backfill driver: shrink chunks when latency rises, grow when stable
DO $$
DECLARE
    chunk_size       INT := 5000;
    last_id          BIGINT := 0;
    rows_updated     INT;
    p99_ms           NUMERIC;
    active_queries   INT;
BEGIN
    LOOP
        -- Backpressure signal 1: long-running queries blocking the table
        SELECT count(*) INTO active_queries
        FROM pg_stat_activity
        WHERE state = 'active'
          AND query_start < now() - interval '5 seconds'
          AND query ILIKE '%transactions%';
 
        IF active_queries > 3 THEN
            PERFORM pg_sleep(5);
            chunk_size := GREATEST(chunk_size / 2, 500);
            CONTINUE;
        END IF;
 
        -- Backpressure signal 2: read p99 from pg_stat_statements
        SELECT COALESCE(MAX(mean_exec_time), 0) INTO p99_ms
        FROM pg_stat_statements
        WHERE query LIKE 'SELECT%FROM transactions%'
          AND calls > 100;
 
        IF p99_ms > 50 THEN
            chunk_size := GREATEST(chunk_size / 2, 500);
        ELSIF p99_ms < 10 THEN
            chunk_size := LEAST(chunk_size * 2, 20000);
        END IF;
 
        -- The actual chunk: bounded by primary key range, not LIMIT-OFFSET
        WITH batch AS (
            SELECT id FROM transactions
            WHERE id > last_id AND idempotency_key IS NULL
            ORDER BY id
            LIMIT chunk_size
            FOR UPDATE SKIP LOCKED
        )
        UPDATE transactions t
        SET idempotency_key = encode(gen_random_bytes(16), 'hex')
        FROM batch
        WHERE t.id = batch.id;
 
        GET DIAGNOSTICS rows_updated = ROW_COUNT;
        EXIT WHEN rows_updated = 0;
 
        SELECT COALESCE(MAX(id), last_id) INTO last_id
        FROM transactions
        WHERE id > last_id AND idempotency_key IS NOT NULL
        ORDER BY id DESC
        LIMIT 1;
 
        PERFORM pg_sleep(0.05);
    END LOOP;
END $$;

FOR UPDATE SKIP LOCKED is the critical primitive. Without it, the backfill blocks on rows currently being written by application traffic and your batches stall. With it, the backfill steps over contended rows and circles back to them on the next pass — application writers never see the backfill in their wait events.

The adaptive chunk sizing is more important than the absolute chunk number. A backfill that runs at 5,000 rows during a quiet 3am hour and 500 rows during a Black Friday checkout spike completes in roughly the same wall-clock time as a fixed-size backfill but with bounded p99 impact. The fixed-size version either finishes too slowly off-peak or causes incidents on-peak.

The Reversal Playbook

The migration is half done. Forty percent of the rows have new values, sixty percent are null, the dual-write code is deployed, and the canary is showing a 4x latency regression on the new index. You need to roll back without losing the partial progress and without leaving the schema in a state that confuses the next deploy. This is the scenario most migration guides skip. ^{[Postgres B-tree]}

The reversal rules: never DROP COLUMN during a partial backfill — the column may be referenced by trigger functions, materialised views, or pg_dump jobs that will fail loudly at 4am.^{[PostgreSQL Docs]} Never reset the backfill to NULL and start over — you've already paid the I/O cost; preserve it. Always pause first, diagnose second, decide third.

-- Step 1: pause the backfill but preserve partial progress
UPDATE feature_flags
SET rollout_percent = 0, updated_at = now()
WHERE name = 'read_idempotency_key_column';
 
-- Step 2: kill any in-flight backfill statements (preserves committed batches)
SELECT pg_cancel_backend(pid)
FROM pg_stat_activity
WHERE query ILIKE '%UPDATE transactions%idempotency_key%'
  AND state = 'active'
  AND application_name = 'backfill_worker';
 
-- Step 3: snapshot the partial state for forensics
CREATE TABLE migration_snapshot_2026_05_01 AS
SELECT id, idempotency_key, created_at
FROM transactions
WHERE idempotency_key IS NOT NULL;
 
-- Step 4: confirm dual-write is still populating new rows correctly
SELECT count(*) FILTER (WHERE idempotency_key IS NULL) AS unfilled,
       count(*) FILTER (WHERE idempotency_key IS NOT NULL) AS filled,
       max(created_at) FILTER (WHERE idempotency_key IS NULL) AS last_unfilled
FROM transactions
WHERE created_at > now() - interval '1 hour';

If last_unfilled is older than your deploy completion timestamp, dual-write is healthy and the rollback is clean: just leave the partial backfill in place, fix the underlying issue (usually a missing index or a wrong default), and resume. If last_unfilled is recent, the dual-write code path itself is broken — a far more serious incident that requires reverting the application deploy before touching the schema.

The decision tree: broken read path -> flip flag, leave schema alone, fix forward. Broken write path -> flip flag, revert app deploy, leave schema alone. Broken schema (wrong type, wrong nullability) -> flip flag, revert app deploy, then add a new column with the correct definition rather than mutating the broken one. PostgreSQL's catalog accumulates dead columns cheaply; mutating a column that's already been written to a million times is the expensive operation. The discipline of "always add, never mutate" turns half-completed migrations into recoverable states instead of all-night incidents.

Frequently Asked Questions

What is the expand-contract pattern for database migrations?

Expand-contract splits a breaking schema change into safe phases: expand (add the new column/table alongside the old one, dual-write to both), deploy new code that reads from the new location, then contract (drop the old column once nothing reads it). Each phase is independently deployable and rollback-safe.

How do you add a column to a large table without locking it?

In PostgreSQL 11+, ADD COLUMN with a nullable column and no default takes only a brief ACCESS EXCLUSIVE lock (metadata-only). For columns with defaults or NOT NULL constraints, use the expand-contract pattern: add a nullable column first, backfill in batches, then add the constraint. Use gh-ost for MySQL to avoid table-level locks entirely.

What is gh-ost and when should you use it?

gh-ost is GitHub's online schema migration tool for MySQL. It creates a shadow copy of the table, applies the schema change to the copy, streams binlog changes to keep it in sync, then atomically swaps the tables. Use it for any DDL on large MySQL tables to avoid the table locks that native ALTER TABLE requires.

How do you monitor database migrations in production?

Watch replication lag (a long-running DDL can cause replicas to fall behind by minutes), lock wait times (pg_stat_activity for blocked queries), query latency percentiles, and error rates. Set alerts on replication lag exceeding 30 seconds and have a rollback plan ready for 24-48 hours post-migration.

Keep Reading

Database Indexing Strategies — B-tree internals, composite index design, and bloat monitoring for newly-created indexes
PostgreSQL Query Planner Internals — Read EXPLAIN output and verify the planner uses your new indexes after migration
Graceful Shutdown in Production Go Services — Drain in-flight requests during rolling deploys that accompany database migrations
Postgres EXPLAIN Cheat Sheet — Verify post-migration plans against pre-migration baseline; catch regressions before they hit production
SRE: SLOs, SLIs, and Error Budgets — Multi-window burn-rate alerting watches the migration's error budget consumption in real time

Was this article helpful?

Your feedback directly shapes our editorial depth and technical accuracy.