Kubernetes Networking Deep Dive: From Pods to Production
Key Takeaways
- →VXLAN encapsulation adds a 50-byte header — missing MTU tuning causes silent packet fragmentation adding 5–15ms latency invisible to APM; a platform team spent a week chasing phantom slowness
- →iptables mode at 5,000+ services creates 50,000+ rules — each packet traverses the chain, O(n) latency on lookup; switch to IPVS hash table (O(1)) or migrate to Cilium eBPF entirely
- →CoreDNS `ndots:5` causes up to 5 failed DNS queries per external hostname lookup — set `ndots:1` or use node-local DNS caching to eliminate this overhead at scale
- →Cilium eBPF replaces kube-proxy, eliminating iptables bottleneck — service resolution drops from 40µs to <1µs; requires kernel 4.19+ and higher debugging complexity
- →Choose CNI by infrastructure fit: VXLAN (anywhere, zero setup), BGP (bare metal, <1ms), eBPF (high scale, zero-trust) — don't migrate until you measure real latency problems
The classic Kubernetes VXLAN-MTU production incident. A platform team migrates a service to Kubernetes and the same code runs measurably slower than on the prior VM environment. CPU is fine, memory is fine, the application code is identical. A network trace eventually shows pod-to-pod requests being silently fragmented because the VXLAN overlay added 50 bytes to a 1500-byte MTU and nobody set
--mtu=1450in the CNI config. We debugged this exact incident on multiple production migrations.
The Invisible Latency
The classic VXLAN-MTU-fragmentation incident. A platform team migrates a service to Kubernetes and notices it's measurably slower than the same code running on the previous VM environment. CPU is fine, memory is fine, the application code is identical. A network trace eventually shows pod-to-pod requests being fragmented. The VXLAN overlay tunnel their CNI plugin used has a default MTU of 1500 bytes — the same as the underlying network interface. But VXLAN[Kubernetes docs] adds a 50-byte header. The effective MTU for encapsulated pod traffic is 1450 bytes, and nobody set --mtu=1450 in the CNI config. Every request exceeding 1450 bytes is silently fragmented and reassembled by the kernel — invisible to every APM tool they had. We've seen this exact bug surface on multiple production migrations.
This is the reality of Kubernetes networking: it works most of the time, breaks in ways that are difficult to observe, and requires understanding the actual implementation — not the API — to debug.
Kubernetes networking[Kubernetes docs] happens in four stacked layers: pod-to-pod routing (CNI), service load balancing (kube-proxy), DNS resolution (CoreDNS), and ingress. Most production failures hide in layers 2 and 3. Understanding MTU offsets, iptables vs IPVS tradeoffs, and the ndots DNS tax will solve 90% of what you encounter.
- CNI choice drives latency: VXLAN is safe but requires MTU tuning (1450 bytes). Cilium eBPF eliminates kube-proxy overhead but needs kernel 4.19+.
- kube-proxy at scale: Switch from iptables to IPVS at 5,000+ services or migrate to Cilium entirely.
- CoreDNS is a bottleneck: Set
ndots:1in pod specs to eliminate the 5-query DNS tax. Monitor CoreDNS CPU and scale horizontally.
graph LR
subgraph Node1 ["Node A"]
P1["Pod A<br/>10.244.1.5"] -->|veth pair| BR1["cbr0 bridge"]
end
subgraph Node2 ["Node B"]
BR2["cbr0 bridge"] -->|veth pair| P2["Pod B<br/>10.244.2.8"]
end
BR1 -->|"CNI overlay<br/>(VXLAN/eBPF)"| BR2
P1 -.->|"via Service ClusterIP"| KP["kube-proxy<br/>iptables/IPVS"]
KP -->|"DNAT to pod IP"| P2
The Quick Start: Three CNI Approaches
Kubernetes imposes three networking constraints: every pod gets its own IP, pods on any node can reach pods on any other node without NAT, and node agents can reach all pods on that node. How those constraints are implemented is delegated to a CNI (Container Network Interface) plugin.
| CNI Type | Mechanism | Latency | Setup Complexity | Common Use |
|---|---|---|---|---|
| VXLAN Overlay (Flannel, Calico) | Encapsulates packets in UDP tunnels | ~5-15ms (MTU fragmentation risk) | Low — works anywhere | Cloud, dev, on-prem |
| BGP Underlay (Calico native) | Routes directly via BGP announcements | <1ms (zero encapsulation) | Medium — needs BGP network | Bare metal, VPC peering |
| eBPF (Cilium) | Direct socket-level bypass | <1µs (kube-proxy replaced) | Medium — kernel 4.19+ required | High scale, zero-trust |
CNI Plugins: Three Approaches to the Same Problem
[CNI spec]VXLAN Overlay (Flannel, Calico in VXLAN mode)
VXLAN encapsulates Ethernet frames inside UDP packets. The problem: the 50-byte VXLAN header reduces effective MTU from 1500 to 1450 bytes. Miss this tuning and every oversized packet gets silently fragmented by the kernel, adding 5-15ms per request.
## Check VXLAN MTU is set correctly
ip link show flannel.1 | grep mtu
## Should show: mtu 1450 (not 1500)
## Configure for Calico VXLAN
kubectl get cm -n kube-system calico-config -o yaml | grep mtuWhen to use: Cloud VPCs, on-prem, air-gapped environments. Works everywhere; no special network config required.
When it fails: Forget the MTU offset and you get invisible 10ms+ latency on database queries.
BGP Underlay (Calico in BGP mode)
Instead of tunneling, Calico uses BGP to advertise pod CIDR routes directly to the network. Routers install these as L3 routes — zero encapsulation, zero MTU concerns.
## Verify Calico BGP peers are established
kubectl exec -it -n kube-system \
$(kubectl get pods -n kube-system -l k8s-app=calico-node -o name | head -1) \
-- calicoctl node statusWhen to use: Bare-metal clusters or clouds that support VPC peering (AWS). Absolute minimum latency (<1ms).
When it fails: Your network doesn't speak BGP or you're in a subnet-constrained corporate network.
eBPF (Cilium)
eBPF programs bypass the kernel network stack and iptables entirely. Direct socket-level redirection replaces kube-proxy.
At large-scale clusters (thousands of nodes), iptables rule counts can exceed one million entries, causing measurable latency during chain traversal. Migrating to Cilium eliminates kube-proxy entirely — eBPF socket-level redirection bypasses iptables, and operators report service resolution latency dropping from double-digit microseconds to sub-microsecond.
## Verify Cilium eBPF is active
kubectl -n kube-system exec ds/cilium -- cilium status --verbose | grep "kube-proxy replacement"
## Should show: KubeProxyReplacement: TrueWhen to use: Scale >1,000 nodes, strict zero-trust security (L7 policies), minimum latency required.
When it fails: Running Linux kernel <4.19 or unable to absorb higher debugging complexity.
kube-proxy: Service Load Balancing
[Kubernetes docs]The packet path through Kubernetes networking — every layer adds CPU + latency:
graph TD
Client[Client pod<br/>10.244.1.5] -->|GET service:8080| LookupDNS[CoreDNS lookup<br/>service.ns.svc.cluster.local]
LookupDNS -->|ClusterIP<br/>10.96.42.10| KubeProxy[kube-proxy node-local<br/>iptables DNAT or IPVS]
KubeProxy -->|select random endpoint<br/>10.244.2.7| CNI[CNI dataplane<br/>VXLAN / BGP / eBPF]
CNI -->|encapsulated<br/>or BGP-routed| TargetPod[Target pod<br/>10.244.2.7]
TargetPod -->|response| Client
KubeProxy -.->|at 5000 services<br/>iptables = 50000 rules<br/>O of n| Slow[CPU + latency<br/>per packet]
KubeProxy -.->|switch to IPVS:<br/>hash table O of 1| Fast[Constant time<br/>at any service count]
style Slow fill:#fdd
style Fast fill:#dfd
style CNI fill:#ffd
Three layers, three failure modes: DNS resolution (ndots tax), kube-proxy (rule explosion at scale), CNI (MTU, encapsulation overhead).
When you create a Service, kube-proxy programs rules on every node to DNAT traffic from the service's ClusterIP to a randomly selected pod IP.
iptables mode (default): Rule traversal is O(n). At 5,000+ services, you have ~50,000 iptables rules. Every packet traverses the chain, causing measurable latency and CPU overhead.
## See the rule explosion
iptables-save | grep "KUBE-SVC-" | wc -l
## > 50,000 lines = time to switch to IPVSIPVS mode (recommended at scale): Hash table lookup is O(1), regardless of service count.
## Enable IPVS mode
kubectl edit configmap kube-proxy -n kube-system
## Set: mode: "ipvs"
## Set: ipvs.scheduler: "lc" # least-connection (better than random round-robin)
## Verify IPVS rules
ipvsadm -Ln | grep -A3 "10.96.0.1"
## TCP 10.96.0.1:443 lc
## -> 10.0.1.10:6443 Masq 1 0 0
## -> 10.0.1.11:6443 Masq 1 0 0Better choice at scale: Migrate to Cilium's eBPF-based service routing — eliminates kube-proxy entirely and reduces service resolution latency from 40µs to <1µs.
CoreDNS: The DNS Tax
[CoreDNS docs]The default pod /etc/resolv.conf sets ndots:5, which means: if a hostname has fewer than 5 dots, try appending each search domain before it as-is.
For a fully qualified internal name like redis.default.svc.cluster.local (4 dots), the resolver generates 4 queries:
redis.default.svc.cluster.local.default.svc.cluster.local→ NXDOMAINredis.default.svc.cluster.local.svc.cluster.local→ NXDOMAINredis.default.svc.cluster.local.cluster.local→ NXDOMAINredis.default.svc.cluster.local→ SUCCESS
At scale, this DNS query flood overwhelms CoreDNS (single-threaded per replica). Under load, UDP packets drop, applications hang waiting for DNS timeouts.
Solution 1: Set ndots:1 per pod
apiVersion: v1
kind: Pod
spec:
dnsConfig:
options:
- name: ndots
value: "1"
- name: use-vc # Use TCP to avoid truncation
containers:
- name: appSolution 2: Use FQDNs with trailing dot
// In Go: trailing dot tells resolver "skip search domains"
conn, err := pgxpool.New(ctx, "postgres://redis.default.svc.cluster.local.:5432/app")Solution 3: Scale CoreDNS horizontally
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: coredns
namespace: kube-system
spec:
scaleTargetRef:
kind: Deployment
name: coredns
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
averageUtilization: 70## Monitor CoreDNS CPU and error rates
kubectl top pod -n kube-system -l k8s-app=kube-dnsServices and Ingress
[Kubernetes docs]ClusterIP (default): Virtual IP inside cluster only. kube-proxy DNAT rules route traffic to endpoints.
NodePort: Exposes service on all node IPs (ports 30000–32767). Useful for debugging, not production.
LoadBalancer: Provisions external LB (AWS NLB, GCP Network LB). Two hops: external LB → NodePort → pod. Use externalTrafficPolicy: Local to preserve client IP and avoid extra hop (requires one pod per node).
apiVersion: v1
kind: Service
metadata:
name: api-server
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
type: LoadBalancer
selector:
app: api-server
ports:
- port: 443
targetPort: 8080
externalTrafficPolicy: Local # Preserves client IP, avoids extra hopExternalName: CNAME alias to external service. No proxying, just DNS.
apiVersion: v1
kind: Service
metadata:
name: legacy-db
spec:
type: ExternalName
externalName: postgres-legacy.us-east-1.rds.amazonaws.comIngress routes external HTTP/HTTPS traffic to services. Standard Ingress only supports host+path routing; advanced features require controller-specific annotations.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
spec:
ingressClassName: nginx
tls:
- hosts:
- api.example.com
secretName: api-tls-cert
rules:
- host: api.example.com
http:
paths:
- path: /v1
pathType: Prefix
backend:
service:
name: api-v1
port:
number: 80Network Policies: Zero-Trust Segmentation
By default, all pods can communicate with all others across all namespaces. Network Policies enforce zero-trust: default-deny, then explicitly allow.
## Step 1: Deny all ingress and egress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: payments
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
## Step 2: Allow payments to accept only from checkout on port 8080
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: payments-ingress
namespace: payments
spec:
podSelector:
matchLabels:
app: payment-service
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: checkout-service
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: checkout
ports:
- protocol: TCP
port: 8080
---
## Step 3: Allow payments to reach only PostgreSQL and DNS
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: payments-egress
namespace: payments
spec:
podSelector:
matchLabels:
app: payment-service
policyTypes:
- Egress
egress:
- ports:
- protocol: UDP
port: 53
- to:
- podSelector:
matchLabels:
app: postgres
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: data
ports:
- protocol: TCP
port: 5432Critical: Network Policies are enforced by the CNI plugin, not Kubernetes. Flannel doesn't enforce them by default — they're silently ignored. Always test:
## Create deny-all and verify it actually blocks traffic
kubectl apply -f deny-all-policy.yaml
kubectl run test-pod --rm -it --image=alpine -- sh
## Inside test-pod:
wget -qO- --timeout=3 http://payment-service.payments:8080/health
## Should fail if enforcement is workingDebugging Production Issues
See the Kubernetes Commands Cheat Sheet for a complete reference.
DNS failures: CoreDNS unhealthy, or ndots causing query floods.
## Check CoreDNS health
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns | tail -20
## Test DNS from a pod
kubectl exec -it pod/api-server -- nslookup redis.default.svc.cluster.local
## Check NXDOMAIN error rate
kubectl exec -n kube-system $(kubectl get pods -n kube-system -l k8s-app=kube-dns -o name | head -1) \
-- wget -qO- http://localhost:9153/metrics | grep 'rcode="NXDOMAIN"'Service routing failures: Endpoints empty or kube-proxy not programming rules.
## Check if service has healthy endpoints
kubectl get endpoints api-server -o yaml
## View iptables rules for a service
SERVICE_IP=$(kubectl get svc api-server -o jsonpath='{.spec.clusterIP}')
iptables-save | grep $SERVICE_IP
## For IPVS, check virtual servers
ipvsadm -Ln | grep -A5 $SERVICE_IPPod connectivity failures: CNI IP exhaustion or policy enforcement.
## Debug pod stuck in ContainerCreating
kubectl describe pod <name>
## Look for: "NetworkPlugin cni failed to set up pod network"
## Check CNI plugin logs
kubectl logs -n kube-system ds/cilium | grep -i error
kubectl logs -n kube-system -l app=calico-node -c calico-node | tail -50
## For AWS VPC CNI: check IP pool exhaustion
kubectl describe node <name> | grep -A5 "vpc.amazonaws.com/eni-max-pods"Network latency: Use netshoot debug container for packet-level analysis.
kubectl debug -it --image=nicolaka/netshoot --target=payment-service \
pod/payment-service-xyz
## Inside netshoot:
tcpdump -i eth0 -n host 10.244.2.3 and port 5432
dig +trace redis.default.svc.cluster.local
iptables-save | grep KUBE-SVCProduction Checklist
Before shipping any Kubernetes workload to production, verify these networking fundamentals:
- Pod IP allocation: Is your VPC subnet sized for 3x peak pod count? For AWS VPC CNI, enable prefix delegation (
ENABLE_PREFIX_DELEGATION=true) to batch IP allocation. - MTU offset: If using VXLAN (Flannel, Calico overlay), explicitly set pod MTU to 1450. Verify:
ip link show | grep mtu. - kube-proxy mode: At 5,000+ services, switch from iptables to IPVS or migrate to Cilium eBPF.
- CoreDNS scale: Set pod
dnsConfig.ndots: 1to eliminate DNS query floods. Scale CoreDNS HPA to at least 2 replicas with CPU-based autoscaling. - Network policies: Test that deny-all policies actually block traffic. Flannel doesn't enforce them by default.
- Service traffic policy: Use
externalTrafficPolicy: Localon LoadBalancer/NodePort services to preserve client IP and avoid extra hop (requires one pod per node). - DNS endpoints: Always use FQDN with trailing dot for external hostnames, or configure
ndots:1in pod specs. [Kubernetes docs]
Why These Four Layers Matter
Kubernetes networking is four stacked problems:
- Pod networking (CNI): Packets from pod to pod across nodes.
- Service routing (kube-proxy/eBPF): ClusterIP to pod IP mapping and load balancing.
- Name resolution (CoreDNS): "redis.default" → 10.96.42.1.
- Ingress: External traffic into the cluster.
Most production failures hide in layers 2 and 3. The team from the opening discovered their VXLAN MTU misconfiguration via packet trace, patched the CNI config to set MTU 1450, and their checkout service returned to baseline latency within minutes.
Eight milliseconds. Invisible to every monitoring tool except a raw packet trace. That's why understanding the implementation — not just the API — is critical.
Cilium eBPF in Production
Replacing kube-proxy with Cilium is the most consequential networking decision a large cluster makes. The headline win is service resolution latency, but the operational benefits compound: no iptables rule explosion, native L7 policy enforcement, identity-based security, and Hubble flow observability without sidecars. Verify the replacement is actually engaged — running Cilium alongside kube-proxy is a common misconfiguration that gives you the worst of both worlds.
## Confirm full kube-proxy replacement and key datapath features
kubectl -n kube-system exec ds/cilium -- cilium status --verbose \
| grep -E "KubeProxyReplacement|Host Routing|Masquerading|BPF"
## Should report:
## KubeProxyReplacement: True [eth0 10.0.1.5 (Direct Routing)]
## Host Routing: BPF
## Masquerading: BPF [eth0] 10.244.0.0/16 [IPv4]The metrics that prove it worked: p99 service-call latency drops by the iptables traversal cost (typically 30–80µs at 5,000 services), node_netfilter_conntrack_count flattens, and CPU consumed by kube-proxy and iptables-restore disappears entirely from node profiles. Watch cilium_bpf_map_pressure for any map approaching capacity (> 0.9 means raise the limits before the datapath starts dropping). Pair Cilium with Hubble to get per-flow visibility — hubble observe --verdict DROPPED replaces three hours of tcpdump archaeology when a NetworkPolicy denies traffic you didn't expect.
Multi-Cluster Networking
Once you outgrow a single cluster — for blast-radius isolation, regional locality, or tenant separation — you need a story for cross-cluster service discovery. There is no single right answer; pick by the failure mode you most want to avoid.
- Submariner tunnels pod and service CIDRs between clusters via IPsec or WireGuard. Simplest to bolt onto existing clusters with non-overlapping CIDRs; the tunnel gateway is a throughput and failure pinch point.
- Cilium Cluster Mesh federates clusters at the eBPF layer, sharing service endpoints natively without an overlay. Lowest latency and the cleanest policy story, but requires Cilium on every cluster and stable, routable pod CIDRs between them.
- Istio multi-cluster layers cross-cluster service discovery on top of mTLS. The richest L7 traffic-management story (locality-aware failover, traffic shifting), at the cost of running the full mesh control plane and east-west gateways.
The operational tax is real: certificate rotation, asymmetric routing, and split-brain DNS show up first. Start with one shared service (auth, payments) before federating the entire catalogue.
Debugging "Pod Can't Reach Service"
When a pod fails to reach a service, walk the layers in order. Skipping a step costs hours.
- DNS resolves.
kubectl exec pod -- nslookup api.default.svc.cluster.localreturns a ClusterIP. IfNXDOMAIN, check CoreDNS pods, the pod's/etc/resolv.conf, and any NetworkPolicy blocking UDP 53 tokube-system. - Service has endpoints.
kubectl get endpoints api -o widelists at least one ready pod IP. An empty list means the selector doesn't match running pods or readiness probes are failing. - kube-proxy programmed the rules. On the source node, confirm DNAT rules exist for the ClusterIP, or that Cilium has loaded the service map.
- Target pod is reachable.
kubectl exec pod -- nc -vz <pod-ip> <port>from the source node namespace. Failure here points at CNI, MTU, or NetworkPolicy. - NetworkPolicy isn't denying it. Test the path with the policy temporarily relaxed; with Cilium,
hubble observe --to-pod default/api --verdict DROPPEDnames the rule.
## A tight loop that walks the tree on a real cluster
SRC=$(kubectl get pod -l app=checkout -o jsonpath='{.items[0].metadata.name}')
SVC=api
kubectl exec "$SRC" -- nslookup "$SVC".default.svc.cluster.local
kubectl get endpoints "$SVC" -o wide
SVC_IP=$(kubectl get svc "$SVC" -o jsonpath='{.spec.clusterIP}')
kubectl exec "$SRC" -- nc -vz "$SVC_IP" 80
kubectl exec "$SRC" -- wget -qO- --timeout=3 "http://$SVC/healthz"In our experience, ~70% of "service unreachable" tickets stop at step 2 (no ready endpoints) or step 5 (a default-deny policy nobody remembered).
Frequently Asked Questions
What is a CNI plugin in Kubernetes?
A CNI (Container Network Interface) plugin implements the Kubernetes networking model — every pod gets its own IP, and pods communicate across nodes without NAT. Common plugins: Flannel (VXLAN), Calico (BGP or VXLAN), Cilium (eBPF).
What causes mystery latency in pod-to-pod networking?
VXLAN overlays add a 50-byte header per packet. If the node MTU is 1500 but CNI MTU is not set to 1450, packets over 1450 bytes are silently fragmented by the kernel, adding 5-15ms latency invisible to application monitoring.
How does kube-proxy route traffic to pods?
kube-proxy watches the API for Service and Endpoint changes, then programs iptables rules (or IPVS entries) on each node. When traffic hits a Service ClusterIP, these rules DNAT the packet to a randomly selected pod IP for load balancing.
What is the CoreDNS ndots tax?
The default ndots:5 setting causes up to 5 failed DNS queries for every external hostname lookup. Set ndots:1 or use node-local DNS caching to eliminate this overhead at scale.
Keep Reading
- Essential Kubernetes Commands Cheat Sheet — Quick-reference
kubectlcommands for pods, deployments, debugging, and cluster administration - Essential Docker Commands Cheat Sheet — Container lifecycle, image management, and multi-stage build patterns for the containers Kubernetes orchestrates
- The 3 Pillars of Observability — Metrics, logs, and traces for monitoring the network layer and diagnosing the failures described in this article
Engineering Team
A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.
Read Next
Terraform in Production: Modules, State Management, and CI/CD Patterns
Terraform in production: state locking, module design, environment directories, and CI/CD guardrails that prevent resource destruction.
Essential Kubernetes Commands: The Complete kubectl Cheat Sheet
Definitive kubectl reference: pod debugging, deployments, StatefulSets, RBAC, scheduling, Helm, and production troubleshooting flowcharts.
Essential Docker Commands: The Complete Cheat Sheet
Docker reference: container lifecycle, image management, volumes, networking, and debugging tools for production systems.