DNS Records: The Complete Production Guide for Backend Engineers
Key Takeaways
- →Facebook disappeared for almost six hours when a BGP script withdrew routes to their DNS servers; third-party revenue impact estimates ranged into the tens of millions — DNS is critical infrastructure that most treat as a checkbox
- →TTL controls propagation speed: 300s for services that might failover (enables 5-minute recovery), 3600s for stable records (reduces resolver load 12x), 60s temporarily during migrations
- →Never CNAME the apex domain (example.com) — use your provider's ALIAS/ANAME record instead; no CNAME means falling back to A records, losing CDN integration flexibility
- →CAA records lock down certificate issuance — without them, any CA in the world can issue certificates for your domain (a real attack vector); SPF/DKIM/DMARC secure email with 10-lookup limit on SPF includes
On October 4, 2021, Facebook disappeared from the internet for nearly six hours. A routine BGP maintenance script withdrew the routes to their DNS nameservers[Meta 2021-10-04 outage]. Engineers couldn't diagnose the problem because the remote management tools also depended on DNS. The fix required physical access to the Santa Clara data center. Three billion users lost access to Facebook, Instagram, and WhatsApp. The cascade is well-documented in Cloudflare's network-side analysis of the BGP and DNS withdrawal. And it started with DNS[RFC 1035].
DNS[RFC 1035] resolves domains to IPs through a hierarchical cache. Every record type — A, CNAME, MX, CAA, SRV, NS — has a specific production use. TTLs control propagation speed; choose 300s for failover-critical services, 3600s for stable records. Secure DNS with CAA records, authenticate email with SPF/DKIM/DMARC, and always manage DNS as code.
- A/AAAA route traffic; CNAME aliases; CAA locks down certificates
- TTL 300s for services, 3600s for stable records, 60s before migrations
- SPF declares senders, DKIM signs messages, DMARC enforces policy
- Manage all records in Terraform — never console-click production DNS
The DNS resolution path
What actually happens when a client looks up api.example.com — the cache hierarchy, then the authoritative chain:
graph LR
Client[Client app] --> Stub[Stub resolver<br/>OS / libc]
Stub -->|cache hit| Done1[Return IP]
Stub -->|cache miss| Recurse[Recursive resolver<br/>1.1.1.1, 8.8.8.8, ISP]
Recurse -->|cache hit| Done1
Recurse -->|cache miss| Root[Root nameservers<br/>13 anycasted]
Root -->|.com NS| TLD[TLD nameservers<br/>Verisign for .com]
TLD -->|example.com NS| Auth[Authoritative nameserver<br/>your DNS provider]
Auth -->|A record| Recurse
Recurse -->|cache + return| Stub
Stub -->|cache + return| Client
Auth -.->|TTL governs<br/>cache lifetime| Recurse
style Done1 fill:#dfd
style Auth fill:#ffd
Most lookups hit the recursive resolver's cache and never reach root. New domains and TTL-expired entries walk the full chain — adding 50-200 ms to first-byte latency. Set TTLs to balance freshness (low TTL = faster failover) against load on your authoritative servers (high TTL = fewer queries).
The quick start: Record types by purpose
| Record Type | Purpose | TTL | Notes |
|---|---|---|---|
| A | IPv4 → domain | 300s | Multiple A records for round-robin. Failover needs low TTL. |
| AAAA | IPv6 → domain | 300s | Required for IPv6-only clients. Publish both A and AAAA. |
| CNAME | Domain alias | 3600s | Never use at apex (example.com); use ALIAS/ANAME instead. |
| MX | Mail routing | 3600s | Priority values; lower = preferred. Always 2+ for redundancy. |
| TXT | Text records | 3600s | SPF, DKIM, DMARC, ACME validation — arbitrary text. |
| CAA | Cert authority | 3600s | Locks down TLS issuance. No CAA = any CA can issue certs. |
| SRV | Service discovery | 300s | Port + priority. Used by SIP, XMPP, K8s external DNS. |
| NS | Nameservers | 86400s | Zone delegation. Set at registrar; rarely change. |
| DNSSEC | Signing | variable | RRSIG, DNSKEY, DS. Optional unless regulated. |
A and AAAA: Domain-to-IP Mapping
An A record maps a domain to an IPv4 address; AAAA maps to IPv6. Multiple A records for the same domain enable round-robin load distribution:
api.example.com. 300 IN A 203.0.113.10
api.example.com. 300 IN A 203.0.113.11
api.example.com. 300 IN AAAA 2001:db8::1DNS round-robin distributes connections without health checking. If one IP fails, clients still hit it until TTL expires. Use it for coarse distribution across redundant load balancers — never as a replacement for proper load balancing.
Major cloud providers, mobile carriers, and ISPs increasingly route over IPv6. IPv6-only clients must rely on NAT64/DNS64 translation if you omit AAAA records, adding latency and a failure point. Publish both A and AAAA for any public-facing service.
CNAME: Aliases and the Apex Restriction
[RFC 1035]A CNAME record aliases one domain to another — essential for CDN integration and PaaS hosting:
blog.example.com. 3600 IN CNAME d1234abcd.cloudfront.net.
staging.example.com. 300 IN CNAME example-app.fly.dev.Critical constraint: A CNAME cannot coexist with other record types. Since the apex domain (example.com) requires SOA and NS records, you cannot CNAME at apex. Use your provider's ALIAS/ANAME record (Cloudflare CNAME flattening, Route 53 ALIAS, DNSimple ALIAS, NS1 linked record) or fall back to an A record pointing to a stable IP.
MX: Email Routing with Priority
MX records direct incoming email to mail servers with priority values (lower = preferred). Always configure at least 2 MX records for redundancy:
example.com. 3600 IN MX 10 mail1.example.com.
example.com. 3600 IN MX 20 mail2.example.com.
; MX targets need their own A records
mail1.example.com. 3600 IN A 203.0.113.50
mail2.example.com. 3600 IN A 203.0.113.51MX targets must be hostnames with A/AAAA records — never point MX at an IP or CNAME. If using Google Workspace or Microsoft 365, they provide the MX records.
TXT: SPF, DKIM, and DMARC for Email Authentication
TXT records store arbitrary text, primarily used for email authentication. Without SPF/DKIM/DMARC, receiving servers can't verify your domain — result: spam folder.
When a recipient's mail server receives mail claiming to be from your domain, it runs three DNS lookups in parallel. All three must pass (or DMARC's policy explicitly permit the failure) for inbox placement.
graph LR
Mail["Incoming message<br/>From: alice@example.com"] --> SPF{"SPF check<br/>(TXT lookup)"}
Mail --> DKIM{"DKIM check<br/>(selector._domainkey)"}
Mail --> DMARC{"DMARC policy<br/>(_dmarc.example.com)"}
SPF -->|"sender IP in<br/>SPF allowlist?"| Verdict["Combined verdict"]
DKIM -->|"signature valid<br/>vs public key?"| Verdict
DMARC -->|"p=reject /<br/>quarantine / none"| Verdict
Verdict -->|"all pass"| Inbox["✓ Inbox"]
Verdict -->|"SPF or DKIM fail<br/>+ p=reject"| Rejected["✗ Rejected"]
Verdict -->|"fail but p=none"| Spam["⚠ Spam folder"]
Misconfigure any one of the three and mail lands in spam silently — the sender sees a successful SMTP response and no bounce.
SPF (Sender Policy Framework) declares who can send email for your domain:
example.com. 3600 IN TXT "v=spf1 include:_spf.google.com include:sendgrid.net -all"DKIM (DomainKeys) signs outgoing messages cryptographically:
google._domainkey.example.com. 3600 IN TXT "v=DKIM1; k=rsa; p=MIIBIjANBgkqhki..."DMARC enforces policy when SPF/DKIM fail:
_dmarc.example.com. 3600 IN TXT "v=DMARC1; p=reject; rua=mailto:dmarc@example.com; pct=100"Deployment order: start with p=none to collect reports, then move to p=quarantine, then p=reject.
SPF has a hard limit of 10 DNS lookups per policy. Each include: mechanism triggers a lookup; nested includes count.
If exceeded, SPF returns permerror and receivers may reject your email. Consolidate includes and use ip4:/ip6:
directives (which don't count) for static IPs.
CAA: Lock Down Certificate Issuance
[RFC 1035]CAA records specify which Certificate Authorities can issue TLS certificates for your domain. Without CAA, any CA in the world can issue certificates for you — a real attack vector.
example.com. 3600 IN CAA 0 issue "letsencrypt.org"
example.com. 3600 IN CAA 0 issue "amazon.com"
example.com. 3600 IN CAA 0 issuewild "letsencrypt.org"
example.com. 3600 IN CAA 0 iodef "mailto:security@example.com"Since September 2017 (mandated by the CA/Browser Forum under RFC 6844, later superseded by RFC 8659 in 2019), all CAs must check CAA records before issuing. If your domain has CAA records that don't include the requesting CA, issuance is denied.
Verify with: dig example.com CAA +short
If you add CAA records after certificates are issued, ensure the issuing CA is listed. Let's Encrypt ACME clients fail
renewal if letsencrypt.org is missing — a common source of 2 AM pages when certificates expire. Check CAA records
any time you change DNS providers.
SRV: Service Discovery with Port Numbers
SRV records advertise service locations with port numbers (A records can't do this). Used by SIP, XMPP, LDAP, and Kubernetes external DNS:
; Format: _service._protocol.domain TTL IN SRV priority weight port target
_sip._tcp.example.com. 300 IN SRV 10 60 5060 sip1.example.com.
_sip._tcp.example.com. 300 IN SRV 10 40 5060 sip2.example.com.Priority works like MX (lower = preferred). Weight distributes traffic proportionally. Most HTTP services use A/CNAME records with well-known ports (80/443).
NS: Zone Delegation
NS records declare which nameservers are authoritative for a domain or subdomain. Use them for subdomain delegation when a separate team manages their own DNS infrastructure:
internal.example.com. 3600 IN NS ns1.internal-infra.example.com.
internal.example.com. 3600 IN NS ns2.internal-infra.example.com.After delegation, your primary DNS provider stops answering queries under internal.example.com — those go to the delegated nameservers.
DNSSEC: Optional but Recommended
DNSSEC adds cryptographic signatures to DNS responses, preventing tampering. Enable it if you're in a regulated industry, have a high-value phishing target, or your provider makes it one-click (Cloudflare, Route 53, Google Cloud DNS).
Key steps:
- Enable DNSSEC at your DNS provider
- Publish the DS record at your registrar (commonly missed)
- Verify with:
dig example.com +dnssec +short
Keys need periodic rotation. If your provider handles it automatically, you're fine. If you manage it yourself, a missed rotation makes your zone unresolvable for DNSSEC-validating resolvers — worse than no DNSSEC.
TTL Strategy: Balancing Speed vs. Query Volume
[RFC 1035]TTL (Time to Live) controls caching duration. Choose based on change frequency:
| Record Type | TTL | Rationale |
|---|---|---|
| A/AAAA (services) | 300s | Fast failover, acceptable query load |
| A/AAAA (stable) | 3600s | Mail servers, nameservers rarely change |
| CNAME | 3600s | CDN targets stable; lower for blue/green deploys |
| MX, TXT, CAA, NS | 3600s–86400s | Rarely change; planned updates only |
| Pre-migration | 60s | Lower 24h before any IP change |
Critical: if your A record has a 1-hour TTL and you change IPs, clients hit the old IP for up to an hour. Lower to 60s a full day before migration, make the change, verify, then raise back. Some resolvers (particularly ISP resolvers) ignore low TTLs and cache for a minimum of 5-30 minutes. Verify propagation across multiple resolvers with tools like dnschecker.org.
GeoDNS and Health-Checked Failover
Geolocation routing returns different IPs based on client location — use for data residency (GDPR) or region-specific deployments. Latency-based routing measures client-to-endpoint RTT and returns the fastest — better for global APIs. Weighted routing distributes traffic by percentage — useful for canary deployments. Failover routing designates primary/secondary, returning secondary only when primary health checks fail.
Health-checked DNS adds active monitoring: the DNS provider probes endpoints and removes unhealthy ones from responses:
## Route 53: Create a health check
aws route53 create-health-check --caller-reference "api-prod-$(date +%s)" \
--health-check-config '{
"IPAddress": "203.0.113.10",
"Port": 443,
"Type": "HTTPS",
"ResourcePath": "/healthz",
"RequestInterval": 10,
"FailureThreshold": 3,
"EnableSNI": true,
"FullyQualifiedDomainName": "api.example.com"
}'Design your /healthz endpoint to verify real dependencies (database, cache, etc.), not just return 200. Set failure threshold high (3+ failures) to survive transient issues.
Provider support:
| Type | Route 53 | Cloudflare | Google DNS | NS1 |
|---|---|---|---|---|
| Geolocation | Yes | Yes | Yes | Yes |
| Latency | Yes | Yes | No | Yes |
| Weighted | Yes | Yes | Yes | Yes |
| Failover | Yes (health checks) | Yes | No | Yes |
DNS as Code: Terraform
Making DNS changes through a web console causes outages. DNS belongs in version control.
Route 53 example:
resource "aws_route53_zone" "primary" {
name = "example.com"
}
resource "aws_route53_record" "apex" {
zone_id = aws_route53_zone.primary.zone_id
name = "example.com"
type = "A"
ttl = 300
records = ["203.0.113.10"]
}
resource "aws_route53_record" "api_primary" {
zone_id = aws_route53_zone.primary.zone_id
name = "api.example.com"
type = "A"
ttl = 60
failover_routing_policy {
type = "PRIMARY"
}
set_identifier = "api-primary"
records = ["203.0.113.20"]
health_check_id = aws_route53_health_check.api_primary.id
}
resource "aws_route53_record" "mx" {
zone_id = aws_route53_zone.primary.zone_id
name = "example.com"
type = "MX"
ttl = 3600
records = ["10 mail1.example.com"]
}
resource "aws_route53_record" "caa" {
zone_id = aws_route53_zone.primary.zone_id
name = "example.com"
type = "CAA"
ttl = 3600
records = [
"0 issue \"letsencrypt.org\"",
"0 iodef \"mailto:security@example.com\"",
]
}With DNS in Terraform: peer-reviewed changes, rollback via git revert + terraform apply, audit trail, environment parity. Always use terraform import to bring existing records under management before applying changes.
Production Checklist
- A/AAAA records: Publish both IPv4 and IPv6 for public services; use 300s TTL for services that might failover
- CNAME: Never at apex; use ALIAS/ANAME or CNAME flattening
- MX records: At least 2 with priority values (10, 20); targets need A/AAAA records
- SPF: Audit nested includes; stay under 10-lookup limit; use
-all(hard fail) - DKIM: Publish public key as TXT record; selector should identify key pair
- DMARC: Start with
p=none, review reports, move top=quarantine, thenp=reject - CAA records: List only authorized CAs; verify after DNS provider changes
- TTL planning: Lower to 60s at least 24h before IP changes; raise back after verification
- Health checks: Verify real dependencies (database, cache); set failure threshold to 3+
- DNS as code: All records in Terraform; peer review before apply;
terraform importexisting records first - Testing: Verify propagation across multiple resolvers (
dnschecker.org,dig @8.8.8.8,dig @1.1.1.1)
DNS During Incidents: First Five Minutes
When a DNS-rooted outage hits, the temptation is to start mutating records. Don't. The Facebook 2021 cascade lasted nearly six hours partly because the same DNS withdrawal also broke the tooling engineers used to diagnose the problem[Meta 2021-10-04 outage]. Triage in this order: confirm authoritative reachability before touching records, isolate the resolver path before assuming a record is wrong, and only then consider rollback.
## 1. Are your authoritative nameservers reachable at all?
dig +trace example.com NS
dig @ns-1234.awsdns-12.com example.com SOA
dig @ns-1234.awsdns-12.com example.com SOA +tcp +nsid
## 2. Do recursive resolvers see the same answer?
for resolver in 1.1.1.1 8.8.8.8 9.9.9.9 208.67.222.222; do
echo "=== $resolver ==="
dig @$resolver api.example.com A +short +tries=1 +time=2
done
## 3. What is each resolver actually caching, and for how long?
dig @8.8.8.8 api.example.com A | awk '/^api\.example\.com\./ {print $2, $4, $5}'
## 4. Is a DNSSEC chain break the cause? (SERVFAIL with AD bit demand)
dig api.example.com +dnssec +cd # +cd disables validation
dig api.example.com +dnssec # without +cd: SERVFAIL = chain breakIf dig +trace stalls at the TLD step, your delegation is broken — fix the registrar's NS records, not the zone. If +trace succeeds but resolvers disagree, you have a propagation issue and need to wait for TTL expiry, not push more changes. If +cd returns answers but the validating query returns SERVFAIL, your DNSSEC signatures or DS record are out of sync — disable DNSSEC at the registrar before mutating anything else. Maintain at least one out-of-band channel (status page on a different domain, mobile-only Slack workspace, paper runbook with phone numbers) so the team can coordinate when the primary domain itself is the failure.
Multi-Region DNS: Pick the Right Routing Policy
Routing policies are not interchangeable. Picking the wrong one is how teams end up serving European traffic from us-east-1 because the resolver IP geolocates incorrectly.
| Workload | Routing Policy | Why |
|---|---|---|
| Stateless API, global users | Latency-based | Resolver-to-region RTT is the right proxy for user latency; ignores legal boundaries. |
| Regulated data (GDPR, PIPL) | Geolocation (continent) | Hard residency boundary; latency-based may route a Frankfurt user to Virginia at 3 AM. |
| Active-passive DR | Failover with health checks | Health check pulls primary on three consecutive failures; secondary stays cold otherwise. |
| Canary or blue-green release | Weighted (1/99, then 50/50) | Shift traffic incrementally; combine with health checks for automatic rollback. |
| Single-region with anycast in front | Simple A/AAAA | The CDN handles geographic distribution; DNS just points at the anycast VIP. |
A common anti-pattern: latency-based routing for a workload that writes to a region-pinned database. Users get routed to the geographically nearest read replica, then writes fail or replicate slowly because the primary is in another region. Either pin writes through a separate hostname (writes.api.example.com) with simple routing, or use geolocation routing aligned to your data residency.
resource "aws_route53_record" "api_eu" {
zone_id = aws_route53_zone.primary.zone_id
name = "api.example.com"
type = "A"
set_identifier = "eu-west-1"
geolocation_routing_policy { continent = "EU" }
alias {
name = aws_lb.eu_west_1.dns_name
zone_id = aws_lb.eu_west_1.zone_id
evaluate_target_health = true
}
}Always configure a default record (continent = "*" in Route 53) — without it, requests from un-mapped countries return NODATA and your service is invisible to those users.
DNSSEC Done Right: KSK Rotation Without Outages
DNSSEC has two key types: the Key Signing Key (KSK) signs the DNSKEY RRset and is referenced by the DS record at the parent zone; the Zone Signing Key (ZSK) signs everything else. Rotating the ZSK is mostly automatic; rotating the KSK requires coordinating with your registrar. Get this wrong and the entire zone goes SERVFAIL for validating resolvers — closer to half the internet today[RFC 1035].
The safe sequence is double-DS rollover: publish the new DS alongside the old one, wait for parent TTL plus a safety margin, then remove the old DS and old DNSKEY.
## Generate a new KSK (BIND example; cloud providers automate this)
dnssec-keygen -a ECDSAP256SHA256 -f KSK -n ZONE example.com
## Generate the DS record to publish at the registrar
dnssec-dsfromkey -2 Kexample.com.+013+12345.key
## Verify the chain after publishing the new DS
dig example.com DNSKEY +dnssec +short
dig example.com DS @a.gtld-servers.net. +short
delv @1.1.1.1 example.com A +rtrace # validates end-to-end
## Watch for SERVFAIL during the rollover window
dig @1.1.1.1 example.com +dnssec | grep -E 'status|flags'Algorithm choice matters: prefer ECDSA P-256 (algorithm 13) over RSA-2048 — smaller signatures, smaller responses, lower amplification factor. Schedule KSK rotation no more than annually; ZSK rotation every 30-90 days is reasonable. Most teams should let the DNS provider (Route 53, Cloudflare, NS1) handle this automatically; only roll your own DNSSEC key management if you have a compliance requirement that forbids the provider holding key material. [Terraform Docs]
Frequently Asked Questions
Why can't I CNAME at apex?
The apex domain (example.com) must have SOA and NS records. A CNAME cannot coexist with other record types on the same name. Use ALIAS/ANAME records (Cloudflare CNAME flattening, Route 53 ALIAS, DNSimple ALIAS) or fall back to A records pointing to a stable IP.
How do I know if my DNS change has propagated?
Use dig api.example.com to see the TTL remaining on cached records. Lower TTLs propagate faster, but some ISP resolvers ignore low TTLs and cache for 5-30 minutes anyway. Verify across multiple public resolvers: dig @8.8.8.8, dig @1.1.1.1, dig @9.9.9.9.
What happens if I exceed the SPF 10-lookup limit?
SPF evaluation returns permerror and receivers may reject or defer your email entirely. Count lookups with dig TXT _spf.google.com to see nested includes. Consolidate include directives and use ip4:/ip6: for static IPs (which don't count as lookups).
Can I change my DNS provider without breaking anything?
Yes, but verify CAA records after the migration. If you add CAA records and the new provider is not listed, certificate renewal will fail silently. Always check: dig example.com CAA +short.
Why should health checks check the database?
Health checks should verify real dependencies. If your healthz endpoint always returns 200 but your database is down, DNS will keep returning the IP and traffic still fails. Check database connectivity with a timeout; set failure threshold high (3 failures) to survive transient issues.
Keep Reading
- What Happens When You Type a URL: The Complete Production Guide — DNS is layer one of the request lifecycle; this walks through every subsequent layer from TCP handshake to CDN edge logic
- Terraform in Production: Modules, State Management, and CI/CD Patterns — The Terraform patterns used in this article's DNS-as-code section, plus state locking, module design, and CI/CD integration
- HTTP/1.1 vs HTTP/2 vs HTTP/3: The Protocol Evolution Guide — After DNS resolves the IP, the protocol negotiation determines how fast data moves; covers HTTP/2 multiplexing and QUIC's 0-RTT handshake
- Kubernetes Networking Deep Dive — How CoreDNS handles in-cluster service resolution and the
ndots:5tax that surprises every team migrating to Kubernetes - TCP vs UDP vs QUIC: Protocol Selection Under Production Load — DNS over UDP (port 53), DoT/DoH transport, and the resolution failover semantics
Engineering Team
A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.
Read Next
HTTP/1.1 vs HTTP/2 vs HTTP/3: The Protocol Evolution Guide
How HTTP evolved from sequential text to multiplexed binary streams over QUIC. What each version solves and when to upgrade.
OAuth2 and OpenID Connect: Production Security Patterns
OAuth2 flows with PKCE, refresh token rotation, theft detection, and JWT vs opaque token security tradeoffs for production.
Consistent Hashing: The Algorithm Behind Every Scalable Distributed System
Adding one cache server shouldn't invalidate every key. Consistent hashing with virtual nodes and bounded loads — full Go and Java implementations.