Skip to content

DNS Records: The Complete Production Guide for Backend Engineers

BackendBytes Engineering Team
BackendBytes Engineering Team
11 min read
DNS Records: The Complete Production Guide for Backend Engineers

Key Takeaways

  • Facebook disappeared for almost six hours when a BGP script withdrew routes to their DNS servers; third-party revenue impact estimates ranged into the tens of millions — DNS is critical infrastructure that most treat as a checkbox
  • TTL controls propagation speed: 300s for services that might failover (enables 5-minute recovery), 3600s for stable records (reduces resolver load 12x), 60s temporarily during migrations
  • Never CNAME the apex domain (example.com) — use your provider's ALIAS/ANAME record instead; no CNAME means falling back to A records, losing CDN integration flexibility
  • CAA records lock down certificate issuance — without them, any CA in the world can issue certificates for your domain (a real attack vector); SPF/DKIM/DMARC secure email with 10-lookup limit on SPF includes

On October 4, 2021, Facebook disappeared from the internet for nearly six hours. A routine BGP maintenance script withdrew the routes to their DNS nameservers[Meta 2021-10-04 outage]. Engineers couldn't diagnose the problem because the remote management tools also depended on DNS. The fix required physical access to the Santa Clara data center. Three billion users lost access to Facebook, Instagram, and WhatsApp. The cascade is well-documented in Cloudflare's network-side analysis of the BGP and DNS withdrawal. And it started with DNS[RFC 1035].

Quick Take

DNS[RFC 1035] resolves domains to IPs through a hierarchical cache. Every record type — A, CNAME, MX, CAA, SRV, NS — has a specific production use. TTLs control propagation speed; choose 300s for failover-critical services, 3600s for stable records. Secure DNS with CAA records, authenticate email with SPF/DKIM/DMARC, and always manage DNS as code.

  • A/AAAA route traffic; CNAME aliases; CAA locks down certificates
  • TTL 300s for services, 3600s for stable records, 60s before migrations
  • SPF declares senders, DKIM signs messages, DMARC enforces policy
  • Manage all records in Terraform — never console-click production DNS

The DNS resolution path

What actually happens when a client looks up api.example.com — the cache hierarchy, then the authoritative chain:

graph LR
    Client[Client app] --> Stub[Stub resolver<br/>OS / libc]
    Stub -->|cache hit| Done1[Return IP]
    Stub -->|cache miss| Recurse[Recursive resolver<br/>1.1.1.1, 8.8.8.8, ISP]
    Recurse -->|cache hit| Done1
    Recurse -->|cache miss| Root[Root nameservers<br/>13 anycasted]
    Root -->|.com NS| TLD[TLD nameservers<br/>Verisign for .com]
    TLD -->|example.com NS| Auth[Authoritative nameserver<br/>your DNS provider]
    Auth -->|A record| Recurse
    Recurse -->|cache + return| Stub
    Stub -->|cache + return| Client
    Auth -.->|TTL governs<br/>cache lifetime| Recurse
    style Done1 fill:#dfd
    style Auth fill:#ffd

Most lookups hit the recursive resolver's cache and never reach root. New domains and TTL-expired entries walk the full chain — adding 50-200 ms to first-byte latency. Set TTLs to balance freshness (low TTL = faster failover) against load on your authoritative servers (high TTL = fewer queries).

The quick start: Record types by purpose

Record TypePurposeTTLNotes
AIPv4 → domain300sMultiple A records for round-robin. Failover needs low TTL.
AAAAIPv6 → domain300sRequired for IPv6-only clients. Publish both A and AAAA.
CNAMEDomain alias3600sNever use at apex (example.com); use ALIAS/ANAME instead.
MXMail routing3600sPriority values; lower = preferred. Always 2+ for redundancy.
TXTText records3600sSPF, DKIM, DMARC, ACME validation — arbitrary text.
CAACert authority3600sLocks down TLS issuance. No CAA = any CA can issue certs.
SRVService discovery300sPort + priority. Used by SIP, XMPP, K8s external DNS.
NSNameservers86400sZone delegation. Set at registrar; rarely change.
DNSSECSigningvariableRRSIG, DNSKEY, DS. Optional unless regulated.

A and AAAA: Domain-to-IP Mapping

An A record maps a domain to an IPv4 address; AAAA maps to IPv6. Multiple A records for the same domain enable round-robin load distribution:

api.example.com.    300    IN    A    203.0.113.10
api.example.com.    300    IN    A    203.0.113.11
api.example.com.    300    IN    AAAA 2001:db8::1

DNS round-robin distributes connections without health checking. If one IP fails, clients still hit it until TTL expires. Use it for coarse distribution across redundant load balancers — never as a replacement for proper load balancing.

IPv6 Is Not Optional

Major cloud providers, mobile carriers, and ISPs increasingly route over IPv6. IPv6-only clients must rely on NAT64/DNS64 translation if you omit AAAA records, adding latency and a failure point. Publish both A and AAAA for any public-facing service.


CNAME: Aliases and the Apex Restriction

[RFC 1035]

A CNAME record aliases one domain to another — essential for CDN integration and PaaS hosting:

blog.example.com.       3600    IN    CNAME    d1234abcd.cloudfront.net.
staging.example.com.    300     IN    CNAME    example-app.fly.dev.

Critical constraint: A CNAME cannot coexist with other record types. Since the apex domain (example.com) requires SOA and NS records, you cannot CNAME at apex. Use your provider's ALIAS/ANAME record (Cloudflare CNAME flattening, Route 53 ALIAS, DNSimple ALIAS, NS1 linked record) or fall back to an A record pointing to a stable IP.

MX: Email Routing with Priority

MX records direct incoming email to mail servers with priority values (lower = preferred). Always configure at least 2 MX records for redundancy:

example.com.    3600    IN    MX    10    mail1.example.com.
example.com.    3600    IN    MX    20    mail2.example.com.
 
; MX targets need their own A records
mail1.example.com.    3600    IN    A    203.0.113.50
mail2.example.com.    3600    IN    A    203.0.113.51

MX targets must be hostnames with A/AAAA records — never point MX at an IP or CNAME. If using Google Workspace or Microsoft 365, they provide the MX records.

TXT: SPF, DKIM, and DMARC for Email Authentication

TXT records store arbitrary text, primarily used for email authentication. Without SPF/DKIM/DMARC, receiving servers can't verify your domain — result: spam folder.

When a recipient's mail server receives mail claiming to be from your domain, it runs three DNS lookups in parallel. All three must pass (or DMARC's policy explicitly permit the failure) for inbox placement.

graph LR
    Mail["Incoming message<br/>From: alice@example.com"] --> SPF{"SPF check<br/>(TXT lookup)"}
    Mail --> DKIM{"DKIM check<br/>(selector._domainkey)"}
    Mail --> DMARC{"DMARC policy<br/>(_dmarc.example.com)"}
    SPF -->|"sender IP in<br/>SPF allowlist?"| Verdict["Combined verdict"]
    DKIM -->|"signature valid<br/>vs public key?"| Verdict
    DMARC -->|"p=reject /<br/>quarantine / none"| Verdict
    Verdict -->|"all pass"| Inbox["✓ Inbox"]
    Verdict -->|"SPF or DKIM fail<br/>+ p=reject"| Rejected["✗ Rejected"]
    Verdict -->|"fail but p=none"| Spam["⚠ Spam folder"]

Misconfigure any one of the three and mail lands in spam silently — the sender sees a successful SMTP response and no bounce.

SPF (Sender Policy Framework) declares who can send email for your domain:

example.com.    3600    IN    TXT    "v=spf1 include:_spf.google.com include:sendgrid.net -all"

DKIM (DomainKeys) signs outgoing messages cryptographically:

google._domainkey.example.com.    3600    IN    TXT    "v=DKIM1; k=rsa; p=MIIBIjANBgkqhki..."

DMARC enforces policy when SPF/DKIM fail:

_dmarc.example.com.    3600    IN    TXT    "v=DMARC1; p=reject; rua=mailto:dmarc@example.com; pct=100"

Deployment order: start with p=none to collect reports, then move to p=quarantine, then p=reject.

The SPF 10-Lookup Limit

SPF has a hard limit of 10 DNS lookups per policy. Each include: mechanism triggers a lookup; nested includes count. If exceeded, SPF returns permerror and receivers may reject your email. Consolidate includes and use ip4:/ip6: directives (which don't count) for static IPs.

CAA: Lock Down Certificate Issuance

[RFC 1035]

CAA records specify which Certificate Authorities can issue TLS certificates for your domain. Without CAA, any CA in the world can issue certificates for you — a real attack vector.

example.com.    3600    IN    CAA    0 issue "letsencrypt.org"
example.com.    3600    IN    CAA    0 issue "amazon.com"
example.com.    3600    IN    CAA    0 issuewild "letsencrypt.org"
example.com.    3600    IN    CAA    0 iodef "mailto:security@example.com"

Since September 2017 (mandated by the CA/Browser Forum under RFC 6844, later superseded by RFC 8659 in 2019), all CAs must check CAA records before issuing. If your domain has CAA records that don't include the requesting CA, issuance is denied.

Verify with: dig example.com CAA +short

CAA Can Block Cert Renewal

If you add CAA records after certificates are issued, ensure the issuing CA is listed. Let's Encrypt ACME clients fail renewal if letsencrypt.org is missing — a common source of 2 AM pages when certificates expire. Check CAA records any time you change DNS providers.

SRV: Service Discovery with Port Numbers

SRV records advertise service locations with port numbers (A records can't do this). Used by SIP, XMPP, LDAP, and Kubernetes external DNS:

; Format: _service._protocol.domain  TTL  IN  SRV  priority  weight  port  target
_sip._tcp.example.com.    300    IN    SRV    10    60    5060    sip1.example.com.
_sip._tcp.example.com.    300    IN    SRV    10    40    5060    sip2.example.com.

Priority works like MX (lower = preferred). Weight distributes traffic proportionally. Most HTTP services use A/CNAME records with well-known ports (80/443).

NS: Zone Delegation

NS records declare which nameservers are authoritative for a domain or subdomain. Use them for subdomain delegation when a separate team manages their own DNS infrastructure:

internal.example.com.    3600    IN    NS    ns1.internal-infra.example.com.
internal.example.com.    3600    IN    NS    ns2.internal-infra.example.com.

After delegation, your primary DNS provider stops answering queries under internal.example.com — those go to the delegated nameservers.

DNSSEC adds cryptographic signatures to DNS responses, preventing tampering. Enable it if you're in a regulated industry, have a high-value phishing target, or your provider makes it one-click (Cloudflare, Route 53, Google Cloud DNS).

Key steps:

  1. Enable DNSSEC at your DNS provider
  2. Publish the DS record at your registrar (commonly missed)
  3. Verify with: dig example.com +dnssec +short
DNSSEC Key Rotation

Keys need periodic rotation. If your provider handles it automatically, you're fine. If you manage it yourself, a missed rotation makes your zone unresolvable for DNSSEC-validating resolvers — worse than no DNSSEC.

TTL Strategy: Balancing Speed vs. Query Volume

[RFC 1035]

TTL (Time to Live) controls caching duration. Choose based on change frequency:

Record TypeTTLRationale
A/AAAA (services)300sFast failover, acceptable query load
A/AAAA (stable)3600sMail servers, nameservers rarely change
CNAME3600sCDN targets stable; lower for blue/green deploys
MX, TXT, CAA, NS3600s–86400sRarely change; planned updates only
Pre-migration60sLower 24h before any IP change

Critical: if your A record has a 1-hour TTL and you change IPs, clients hit the old IP for up to an hour. Lower to 60s a full day before migration, make the change, verify, then raise back. Some resolvers (particularly ISP resolvers) ignore low TTLs and cache for a minimum of 5-30 minutes. Verify propagation across multiple resolvers with tools like dnschecker.org.

GeoDNS and Health-Checked Failover

Geolocation routing returns different IPs based on client location — use for data residency (GDPR) or region-specific deployments. Latency-based routing measures client-to-endpoint RTT and returns the fastest — better for global APIs. Weighted routing distributes traffic by percentage — useful for canary deployments. Failover routing designates primary/secondary, returning secondary only when primary health checks fail.

Health-checked DNS adds active monitoring: the DNS provider probes endpoints and removes unhealthy ones from responses:

## Route 53: Create a health check
aws route53 create-health-check --caller-reference "api-prod-$(date +%s)" \
  --health-check-config '{
    "IPAddress": "203.0.113.10",
    "Port": 443,
    "Type": "HTTPS",
    "ResourcePath": "/healthz",
    "RequestInterval": 10,
    "FailureThreshold": 3,
    "EnableSNI": true,
    "FullyQualifiedDomainName": "api.example.com"
  }'

Design your /healthz endpoint to verify real dependencies (database, cache, etc.), not just return 200. Set failure threshold high (3+ failures) to survive transient issues.

Provider support:

TypeRoute 53CloudflareGoogle DNSNS1
GeolocationYesYesYesYes
LatencyYesYesNoYes
WeightedYesYesYesYes
FailoverYes (health checks)YesNoYes

DNS as Code: Terraform

Making DNS changes through a web console causes outages. DNS belongs in version control.

Route 53 example:

resource "aws_route53_zone" "primary" {
  name = "example.com"
}
 
resource "aws_route53_record" "apex" {
  zone_id = aws_route53_zone.primary.zone_id
  name    = "example.com"
  type    = "A"
  ttl     = 300
  records = ["203.0.113.10"]
}
 
resource "aws_route53_record" "api_primary" {
  zone_id = aws_route53_zone.primary.zone_id
  name    = "api.example.com"
  type    = "A"
  ttl     = 60
  failover_routing_policy {
    type = "PRIMARY"
  }
  set_identifier  = "api-primary"
  records         = ["203.0.113.20"]
  health_check_id = aws_route53_health_check.api_primary.id
}
 
resource "aws_route53_record" "mx" {
  zone_id = aws_route53_zone.primary.zone_id
  name    = "example.com"
  type    = "MX"
  ttl     = 3600
  records = ["10 mail1.example.com"]
}
 
resource "aws_route53_record" "caa" {
  zone_id = aws_route53_zone.primary.zone_id
  name    = "example.com"
  type    = "CAA"
  ttl     = 3600
  records = [
    "0 issue \"letsencrypt.org\"",
    "0 iodef \"mailto:security@example.com\"",
  ]
}

With DNS in Terraform: peer-reviewed changes, rollback via git revert + terraform apply, audit trail, environment parity. Always use terraform import to bring existing records under management before applying changes.

Production Checklist

  • A/AAAA records: Publish both IPv4 and IPv6 for public services; use 300s TTL for services that might failover
  • CNAME: Never at apex; use ALIAS/ANAME or CNAME flattening
  • MX records: At least 2 with priority values (10, 20); targets need A/AAAA records
  • SPF: Audit nested includes; stay under 10-lookup limit; use -all (hard fail)
  • DKIM: Publish public key as TXT record; selector should identify key pair
  • DMARC: Start with p=none, review reports, move to p=quarantine, then p=reject
  • CAA records: List only authorized CAs; verify after DNS provider changes
  • TTL planning: Lower to 60s at least 24h before IP changes; raise back after verification
  • Health checks: Verify real dependencies (database, cache); set failure threshold to 3+
  • DNS as code: All records in Terraform; peer review before apply; terraform import existing records first
  • Testing: Verify propagation across multiple resolvers (dnschecker.org, dig @8.8.8.8, dig @1.1.1.1)

DNS During Incidents: First Five Minutes

When a DNS-rooted outage hits, the temptation is to start mutating records. Don't. The Facebook 2021 cascade lasted nearly six hours partly because the same DNS withdrawal also broke the tooling engineers used to diagnose the problem[Meta 2021-10-04 outage]. Triage in this order: confirm authoritative reachability before touching records, isolate the resolver path before assuming a record is wrong, and only then consider rollback.

## 1. Are your authoritative nameservers reachable at all?
dig +trace example.com NS
dig @ns-1234.awsdns-12.com example.com SOA
dig @ns-1234.awsdns-12.com example.com SOA +tcp +nsid
 
## 2. Do recursive resolvers see the same answer?
for resolver in 1.1.1.1 8.8.8.8 9.9.9.9 208.67.222.222; do
  echo "=== $resolver ==="
  dig @$resolver api.example.com A +short +tries=1 +time=2
done
 
## 3. What is each resolver actually caching, and for how long?
dig @8.8.8.8 api.example.com A | awk '/^api\.example\.com\./ {print $2, $4, $5}'
 
## 4. Is a DNSSEC chain break the cause? (SERVFAIL with AD bit demand)
dig api.example.com +dnssec +cd   # +cd disables validation
dig api.example.com +dnssec       # without +cd: SERVFAIL = chain break

If dig +trace stalls at the TLD step, your delegation is broken — fix the registrar's NS records, not the zone. If +trace succeeds but resolvers disagree, you have a propagation issue and need to wait for TTL expiry, not push more changes. If +cd returns answers but the validating query returns SERVFAIL, your DNSSEC signatures or DS record are out of sync — disable DNSSEC at the registrar before mutating anything else. Maintain at least one out-of-band channel (status page on a different domain, mobile-only Slack workspace, paper runbook with phone numbers) so the team can coordinate when the primary domain itself is the failure.

Multi-Region DNS: Pick the Right Routing Policy

Routing policies are not interchangeable. Picking the wrong one is how teams end up serving European traffic from us-east-1 because the resolver IP geolocates incorrectly.

WorkloadRouting PolicyWhy
Stateless API, global usersLatency-basedResolver-to-region RTT is the right proxy for user latency; ignores legal boundaries.
Regulated data (GDPR, PIPL)Geolocation (continent)Hard residency boundary; latency-based may route a Frankfurt user to Virginia at 3 AM.
Active-passive DRFailover with health checksHealth check pulls primary on three consecutive failures; secondary stays cold otherwise.
Canary or blue-green releaseWeighted (1/99, then 50/50)Shift traffic incrementally; combine with health checks for automatic rollback.
Single-region with anycast in frontSimple A/AAAAThe CDN handles geographic distribution; DNS just points at the anycast VIP.

A common anti-pattern: latency-based routing for a workload that writes to a region-pinned database. Users get routed to the geographically nearest read replica, then writes fail or replicate slowly because the primary is in another region. Either pin writes through a separate hostname (writes.api.example.com) with simple routing, or use geolocation routing aligned to your data residency.

resource "aws_route53_record" "api_eu" {
  zone_id        = aws_route53_zone.primary.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "eu-west-1"
  geolocation_routing_policy { continent = "EU" }
  alias {
    name                   = aws_lb.eu_west_1.dns_name
    zone_id                = aws_lb.eu_west_1.zone_id
    evaluate_target_health = true
  }
}

Always configure a default record (continent = "*" in Route 53) — without it, requests from un-mapped countries return NODATA and your service is invisible to those users.

DNSSEC Done Right: KSK Rotation Without Outages

DNSSEC has two key types: the Key Signing Key (KSK) signs the DNSKEY RRset and is referenced by the DS record at the parent zone; the Zone Signing Key (ZSK) signs everything else. Rotating the ZSK is mostly automatic; rotating the KSK requires coordinating with your registrar. Get this wrong and the entire zone goes SERVFAIL for validating resolvers — closer to half the internet today[RFC 1035].

The safe sequence is double-DS rollover: publish the new DS alongside the old one, wait for parent TTL plus a safety margin, then remove the old DS and old DNSKEY.

## Generate a new KSK (BIND example; cloud providers automate this)
dnssec-keygen -a ECDSAP256SHA256 -f KSK -n ZONE example.com
 
## Generate the DS record to publish at the registrar
dnssec-dsfromkey -2 Kexample.com.+013+12345.key
 
## Verify the chain after publishing the new DS
dig example.com DNSKEY +dnssec +short
dig example.com DS @a.gtld-servers.net. +short
delv @1.1.1.1 example.com A +rtrace   # validates end-to-end
 
## Watch for SERVFAIL during the rollover window
dig @1.1.1.1 example.com +dnssec | grep -E 'status|flags'

Algorithm choice matters: prefer ECDSA P-256 (algorithm 13) over RSA-2048 — smaller signatures, smaller responses, lower amplification factor. Schedule KSK rotation no more than annually; ZSK rotation every 30-90 days is reasonable. Most teams should let the DNS provider (Route 53, Cloudflare, NS1) handle this automatically; only roll your own DNSSEC key management if you have a compliance requirement that forbids the provider holding key material. [Terraform Docs]

Frequently Asked Questions

Why can't I CNAME at apex?

The apex domain (example.com) must have SOA and NS records. A CNAME cannot coexist with other record types on the same name. Use ALIAS/ANAME records (Cloudflare CNAME flattening, Route 53 ALIAS, DNSimple ALIAS) or fall back to A records pointing to a stable IP.

How do I know if my DNS change has propagated?

Use dig api.example.com to see the TTL remaining on cached records. Lower TTLs propagate faster, but some ISP resolvers ignore low TTLs and cache for 5-30 minutes anyway. Verify across multiple public resolvers: dig @8.8.8.8, dig @1.1.1.1, dig @9.9.9.9.

What happens if I exceed the SPF 10-lookup limit?

SPF evaluation returns permerror and receivers may reject or defer your email entirely. Count lookups with dig TXT _spf.google.com to see nested includes. Consolidate include directives and use ip4:/ip6: for static IPs (which don't count as lookups).

Can I change my DNS provider without breaking anything?

Yes, but verify CAA records after the migration. If you add CAA records and the new provider is not listed, certificate renewal will fail silently. Always check: dig example.com CAA +short.

Why should health checks check the database?

Health checks should verify real dependencies. If your healthz endpoint always returns 200 but your database is down, DNS will keep returning the IP and traffic still fails. Check database connectivity with a timeout; set failure threshold high (3 failures) to survive transient issues.

Keep Reading

BackendBytes Engineering Team
BackendBytes Engineering Team

Engineering Team

A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.

Read Next