Database migrations are one of the riskiest operations in production systems. A poorly executed migration on a billion-row table can lock your database for hours, causing a full outage. Let us cover strategies for running migrations safely at any scale.
The expand-contract pattern is the foundation of zero-downtime migrations. Instead of making a breaking change in one step, you expand the schema first, migrate the data, update the application, and then contract.
graph TD
Starting[Start: App uses v1 Column]
subgraph P1 [Phase 1: Expand]
Add["Add v2 Column (Nullable)"]
end
subgraph P2 [Phase 2: Dual Write]
Write[App Writes to v1 AND v2]
end
subgraph P3 [Phase 3: Backfill]
Copy[Background Job: Copy v1 -> v2]
end
subgraph P4 [Phase 4: Cutover]
Read[App Reads from v2]
end
subgraph P5 [Phase 5: Contract]
Drop[Drop v1 Column]
end
Starting --> Add
Add --> Write
Write --> Copy
Copy --> Read
Read --> Drop
style Write fill:#fff3e0,stroke:#ff9800
style Read fill:#e8f5e9,stroke:#4caf50
Add the new column without removing the old one. The new column must be nullable or have a default value to avoid locking the table.
-- Safe: adds column without table lock (PostgreSQL 11+)
ALTER TABLE users ADD COLUMN email_verified boolean DEFAULT false;Warning: Be aware of lock levels!
Database Operation Lock Type Impact Postgres 11+ ADD COLUMN (DEFAULT NULL)SHARE UPDATE EXCLUSIVESafe (Reads/Writes allowed) Postgres <11 ADD COLUMN (DEFAULT val)ACCESS EXCLUSIVEDANGER (Locks table) MySQL 8.0 ADD COLUMNMetadata LockGenerally Safe (Instant) MySQL 5.7 ADD COLUMNCOPYSlow (Rebuilds table)
Backfill existing rows in batches to avoid overwhelming the database:
def backfill_email_verified(batch_size=1000):
last_id = 0
while True:
rows = db.execute("""
UPDATE users u
SET email_verified = EXISTS(
SELECT 1 FROM email_verifications ev
WHERE ev.user_id = u.id AND ev.verified_at IS NOT NULL
)
WHERE u.id > %s AND u.id <= %s
AND u.email_verified IS NULL
RETURNING u.id
""", (last_id, last_id + batch_size))
if not rows:
break
last_id += batch_size
time.sleep(0.1) # Throttle to reduce replication lagUpdate your application to write to both old and new locations:
class UserService:
def verify_email(self, user_id: str):
self.db.execute(
"UPDATE users SET email_verified = true WHERE id = %s",
(user_id,),
)
self.db.execute(
"INSERT INTO email_verifications (user_id, verified_at) "
"VALUES (%s, NOW()) ON CONFLICT (user_id) DO UPDATE SET verified_at = NOW()",
(user_id,),
)Switch reads to the new column. Monitor for discrepancies.
Remove the old column or table after the migration is verified.
For MySQL, gh-ost creates a ghost table with the desired schema, copies data in the background, and atomically swaps the tables:
gh-ost \
--host=primary.db.internal \
--database=production \
--table=orders \
--alter="ADD COLUMN shipping_method VARCHAR(50) DEFAULT 'standard'" \
--chunk-size=1000 \
--max-load="Threads_running=50" \
--critical-load="Threads_running=200" \
--executeMonitor: replication lag (under 5s), lock wait time (should not increase), query latency (stable), and disk I/O (within capacity). Set alerts that automatically pause the migration if thresholds are exceeded.
Zero-downtime migrations require discipline and planning, but they are achievable at any scale. The expand-contract pattern provides a safe framework, and proper monitoring ensures you catch problems before they impact users. Never rush a migration.
System Architecture Group
Experts in distributed systems, scalability, and high-performance computing.
A deep dive into how PostgreSQL's query planner works, from parsing SQL to generating optimal execution plans. Understand EXPLAIN output like never before.
Understanding the fundamental differences between HTTP/1.1 and HTTP/2 with visual diagrams and practical examples
Stop blindly choosing a database. We benchmark performance, analyze consistency models, and compare operational complexity for high-scale workloads.