#kubernetes #networking #devops #infrastructure #cni #coredns #kube-proxy

Kubernetes Networking Deep Dive: From Pods to Production

Q: What is the CoreDNS ndots tax?

The default `ndots:5` setting causes up to 5 failed DNS queries for every external hostname lookup. Set `ndots:1` or use node-local DNS caching to eliminate this overhead at scale.

BackendBytes Engineering Team

Jan 8, 2025

14 min read

Kubernetes Networking Deep Dive: From Pods to Production

Part of Series: Kubernetes in Production

Lesson 1 of 4

→VXLAN encapsulation adds a 50-byte header — missing MTU tuning causes silent packet fragmentation adding 5–15ms latency invisible to APM; a platform team spent a week chasing phantom slowness
→iptables mode at 5,000+ services creates 50,000+ rules — each packet traverses the chain, O(n) latency on lookup; switch to IPVS hash table (O(1)) or migrate to Cilium eBPF entirely
→CoreDNS `ndots:5` causes up to 5 failed DNS queries per external hostname lookup — set `ndots:1` or use node-local DNS caching to eliminate this overhead at scale
→Cilium eBPF replaces kube-proxy, eliminating iptables bottleneck — service resolution drops from 40µs to <1µs; requires kernel 4.19+ and higher debugging complexity
→Choose CNI by infrastructure fit: VXLAN (anywhere, zero setup), BGP (bare metal, <1ms), eBPF (high scale, zero-trust) — don't migrate until you measure real latency problems

50 bytes of VXLAN header, and identical code runs measurably slower on Kubernetes than on the VMs it left. A platform team migrates a service. CPU is fine, memory is fine, the application code is identical — yet it's slower. A network trace finally shows pod-to-pod requests silently fragmented: the VXLAN overlay added 50 bytes to a 1500-byte MTU and nobody set --mtu=1450 in the CNI config. We debugged this exact incident on multiple production migrations.

The Invisible Latency

Same binary, same CPU budget, same memory ceiling — and it ran slower the moment it landed on Kubernetes. A platform team migrates a service to Kubernetes and notices it's measurably slower than the same code running on the previous VM environment. CPU is fine, memory is fine, the application code is identical. A network trace eventually shows pod-to-pod requests being fragmented. The VXLAN overlay tunnel their CNI plugin used has a default MTU of 1500 bytes — the same as the underlying network interface. But VXLAN^{[Kubernetes docs]} adds a 50-byte header. The effective MTU for encapsulated pod traffic is 1450 bytes, and nobody set --mtu=1450 in the CNI config. Every request exceeding 1450 bytes is silently fragmented and reassembled by the kernel — invisible to every APM tool they had. We've seen this exact bug surface on multiple production migrations.

This is the reality of Kubernetes networking: it works most of the time, breaks in ways that are difficult to observe, and requires understanding the actual implementation — not the API — to debug.

TL;DR

Kubernetes networking^{[Kubernetes docs]} happens in four stacked layers: pod-to-pod routing (CNI), service load balancing (kube-proxy), DNS resolution (CoreDNS), and ingress. Most production failures hide in layers 2 and 3. Understanding MTU offsets, iptables vs IPVS tradeoffs, and the ndots DNS tax will solve 90% of what you encounter.

CNI choice drives latency: VXLAN is safe but requires MTU tuning (1450 bytes). Cilium eBPF eliminates kube-proxy overhead but needs kernel 4.19+.
kube-proxy at scale: Switch from iptables to IPVS at 5,000+ services or migrate to Cilium entirely.
CoreDNS is a bottleneck: Set ndots:1 in pod specs to eliminate the 5-query DNS tax. Monitor CoreDNS CPU and scale horizontally.

graph LR
    subgraph Node1 ["Node A"]
        P1["Pod A<br/>10.244.1.5"] -->|veth pair| BR1["cbr0 bridge"]
    end
    subgraph Node2 ["Node B"]
        BR2["cbr0 bridge"] -->|veth pair| P2["Pod B<br/>10.244.2.8"]
    end

    BR1 -->|"CNI overlay<br/>(VXLAN/eBPF)"| BR2

    P1 -.->|"via Service ClusterIP"| KP["kube-proxy<br/>iptables/IPVS"]
    KP -->|"DNAT to pod IP"| P2

The Quick Start: Three CNI Approaches

Kubernetes imposes three networking constraints: every pod gets its own IP, pods on any node can reach pods on any other node without NAT, and node agents can reach all pods on that node. How those constraints are implemented is delegated to a CNI (Container Network Interface) plugin.

CNI Type	Mechanism	Latency	Setup Complexity	Common Use
VXLAN Overlay (Flannel, Calico)	Encapsulates packets in UDP tunnels	~5-15ms (MTU fragmentation risk)	Low — works anywhere	Cloud, dev, on-prem
BGP Underlay (Calico native)	Routes directly via BGP announcements	`<1ms` (zero encapsulation)	Medium — needs BGP network	Bare metal, VPC peering
eBPF (Cilium)	Direct socket-level bypass	`<1µs` (kube-proxy replaced)	Medium — kernel 4.19+ required	High scale, zero-trust

CNI Plugins: Three Approaches to the Same Problem

^{[CNI spec]}

VXLAN Overlay (Flannel, Calico in VXLAN mode)

VXLAN encapsulates Ethernet frames inside UDP packets. The problem: the 50-byte VXLAN header reduces effective MTU from 1500 to 1450 bytes. Miss this tuning and every oversized packet gets silently fragmented by the kernel, adding 5-15ms per request.

## Check VXLAN MTU is set correctly
ip link show flannel.1 | grep mtu
## Should show: mtu 1450 (not 1500)
 
## Configure for Calico VXLAN
kubectl get cm -n kube-system calico-config -o yaml | grep mtu

When to use: Cloud VPCs, on-prem, air-gapped environments. Works everywhere; no special network config required.

When it fails: Forget the MTU offset and you get invisible 10ms+ latency on database queries.

BGP Underlay (Calico in BGP mode)

Instead of tunneling, Calico uses BGP to advertise pod CIDR routes directly to the network. Routers install these as L3 routes — zero encapsulation, zero MTU concerns.

## Verify Calico BGP peers are established
kubectl exec -it -n kube-system \
  $(kubectl get pods -n kube-system -l k8s-app=calico-node -o name | head -1) \
  -- calicoctl node status

When to use: Bare-metal clusters or clouds that support VPC peering (AWS). Absolute minimum latency (<1ms).

When it fails: Your network doesn't speak BGP or you're in a subnet-constrained corporate network.

eBPF (Cilium)

eBPF programs bypass the kernel network stack and iptables entirely. Direct socket-level redirection replaces kube-proxy.

At large-scale clusters (thousands of nodes), iptables rule counts can exceed one million entries, causing measurable latency during chain traversal. Migrating to Cilium eliminates kube-proxy entirely — eBPF socket-level redirection bypasses iptables, and operators report service resolution latency dropping from double-digit microseconds to sub-microsecond.

## Verify Cilium eBPF is active
kubectl -n kube-system exec ds/cilium -- cilium status --verbose | grep "kube-proxy replacement"
## Should show: KubeProxyReplacement: True

When to use: Scale >1,000 nodes, strict zero-trust security (L7 policies), minimum latency required.

When it fails: Running Linux kernel <4.19 or unable to absorb higher debugging complexity.

kube-proxy: Service Load Balancing

^{[Kubernetes docs]}

The packet path through Kubernetes networking — every layer adds CPU + latency:

graph TD
    Client[Client pod<br/>10.244.1.5] -->|GET service:8080| LookupDNS[CoreDNS lookup<br/>service.ns.svc.cluster.local]
    LookupDNS -->|ClusterIP<br/>10.96.42.10| KubeProxy[kube-proxy node-local<br/>iptables DNAT or IPVS]
    KubeProxy -->|select random endpoint<br/>10.244.2.7| CNI[CNI dataplane<br/>VXLAN / BGP / eBPF]
    CNI -->|encapsulated<br/>or BGP-routed| TargetPod[Target pod<br/>10.244.2.7]
    TargetPod -->|response| Client
    KubeProxy -.->|at 5000 services<br/>iptables = 50000 rules<br/>O of n| Slow[CPU + latency<br/>per packet]
    KubeProxy -.->|switch to IPVS:<br/>hash table O of 1| Fast[Constant time<br/>at any service count]
    style Slow fill:#fdd
    style Fast fill:#dfd
    style CNI fill:#ffd

Three layers, three failure modes: DNS resolution (ndots tax), kube-proxy (rule explosion at scale), CNI (MTU, encapsulation overhead).

When you create a Service, kube-proxy programs rules on every node to DNAT traffic from the service's ClusterIP to a randomly selected pod IP.

iptables mode (default): Rule traversal is O(n). At 5,000+ services, you have ~50,000 iptables rules. Every packet traverses the chain, causing measurable latency and CPU overhead.

## See the rule explosion
iptables-save | grep "KUBE-SVC-" | wc -l
## > 50,000 lines = time to switch to IPVS

IPVS mode (recommended at scale): Hash table lookup is O(1), regardless of service count.

## Enable IPVS mode
kubectl edit configmap kube-proxy -n kube-system
## Set: mode: "ipvs"
## Set: ipvs.scheduler: "lc"  # least-connection (better than random round-robin)
 
## Verify IPVS rules
ipvsadm -Ln | grep -A3 "10.96.0.1"
## TCP  10.96.0.1:443 lc
##   -> 10.0.1.10:6443               Masq    1      0          0
##   -> 10.0.1.11:6443               Masq    1      0          0

Better choice at scale: Migrate to Cilium's eBPF-based service routing — eliminates kube-proxy entirely and reduces service resolution latency from 40µs to <1µs.

CoreDNS: The DNS Tax

^{[CoreDNS docs]}

The default pod /etc/resolv.conf sets ndots:5, which means: if a hostname has fewer than 5 dots, try appending each search domain before it as-is.

For a fully qualified internal name like redis.default.svc.cluster.local (4 dots), the resolver generates 4 queries:

redis.default.svc.cluster.local.default.svc.cluster.local → NXDOMAIN
redis.default.svc.cluster.local.svc.cluster.local → NXDOMAIN
redis.default.svc.cluster.local.cluster.local → NXDOMAIN
redis.default.svc.cluster.local → SUCCESS

At scale, this DNS query flood overwhelms CoreDNS (single-threaded per replica). Under load, UDP packets drop, applications hang waiting for DNS timeouts.

Solution 1: Set ndots:1 per pod

apiVersion: v1
kind: Pod
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "1"
      - name: use-vc # Use TCP to avoid truncation
  containers:
    - name: app

Solution 2: Use FQDNs with trailing dot

// In Go: trailing dot tells resolver "skip search domains"
conn, err := pgxpool.New(ctx, "postgres://redis.default.svc.cluster.local.:5432/app")

Solution 3: Scale CoreDNS horizontally

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: coredns
  namespace: kube-system
spec:
  scaleTargetRef:
    kind: Deployment
    name: coredns
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          averageUtilization: 70

## Monitor CoreDNS CPU and error rates
kubectl top pod -n kube-system -l k8s-app=kube-dns

Services and Ingress

^{[Kubernetes docs]}

ClusterIP (default): Virtual IP inside cluster only. kube-proxy DNAT rules route traffic to endpoints.

NodePort: Exposes service on all node IPs (ports 30000–32767). Useful for debugging, not production.

LoadBalancer: Provisions external LB (AWS NLB, GCP Network LB). Two hops: external LB → NodePort → pod. Use externalTrafficPolicy: Local to preserve client IP and avoid extra hop (requires one pod per node).

apiVersion: v1
kind: Service
metadata:
  name: api-server
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
  type: LoadBalancer
  selector:
    app: api-server
  ports:
    - port: 443
      targetPort: 8080
  externalTrafficPolicy: Local # Preserves client IP, avoids extra hop

ExternalName: CNAME alias to external service. No proxying, just DNS.

apiVersion: v1
kind: Service
metadata:
  name: legacy-db
spec:
  type: ExternalName
  externalName: postgres-legacy.us-east-1.rds.amazonaws.com

Ingress routes external HTTP/HTTPS traffic to services. Standard Ingress only supports host+path routing; advanced features require controller-specific annotations.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-ingress
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - api.example.com
      secretName: api-tls-cert
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /v1
            pathType: Prefix
            backend:
              service:
                name: api-v1
                port:
                  number: 80

Network Policies: Zero-Trust Segmentation

By default, all pods can communicate with all others across all namespaces. Network Policies enforce zero-trust: default-deny, then explicitly allow.

## Step 1: Deny all ingress and egress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: payments
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
 
---
## Step 2: Allow payments to accept only from checkout on port 8080
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: payments-ingress
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: payment-service
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: checkout-service
          namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: checkout
      ports:
        - protocol: TCP
          port: 8080
 
---
## Step 3: Allow payments to reach only PostgreSQL and DNS
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: payments-egress
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: payment-service
  policyTypes:
    - Egress
  egress:
    - ports:
        - protocol: UDP
          port: 53
    - to:
        - podSelector:
            matchLabels:
              app: postgres
          namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: data
      ports:
        - protocol: TCP
          port: 5432

Critical: Network Policies are enforced by the CNI plugin, not Kubernetes. Flannel doesn't enforce them by default — they're silently ignored. Always test:

## Create deny-all and verify it actually blocks traffic
kubectl apply -f deny-all-policy.yaml
kubectl run test-pod --rm -it --image=alpine -- sh
## Inside test-pod:
wget -qO- --timeout=3 http://payment-service.payments:8080/health
## Should fail if enforcement is working

Debugging Production Issues

See the Kubernetes Commands Cheat Sheet for a complete reference.

DNS failures: CoreDNS unhealthy, or ndots causing query floods.

## Check CoreDNS health
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns | tail -20
 
## Test DNS from a pod
kubectl exec -it pod/api-server -- nslookup redis.default.svc.cluster.local
 
## Check NXDOMAIN error rate
kubectl exec -n kube-system $(kubectl get pods -n kube-system -l k8s-app=kube-dns -o name | head -1) \
  -- wget -qO- http://localhost:9153/metrics | grep 'rcode="NXDOMAIN"'

Service routing failures: Endpoints empty or kube-proxy not programming rules.

## Check if service has healthy endpoints
kubectl get endpoints api-server -o yaml
 
## View iptables rules for a service
SERVICE_IP=$(kubectl get svc api-server -o jsonpath='{.spec.clusterIP}')
iptables-save | grep $SERVICE_IP
 
## For IPVS, check virtual servers
ipvsadm -Ln | grep -A5 $SERVICE_IP

Pod connectivity failures: CNI IP exhaustion or policy enforcement.

## Debug pod stuck in ContainerCreating
kubectl describe pod <name>
## Look for: "NetworkPlugin cni failed to set up pod network"
 
## Check CNI plugin logs
kubectl logs -n kube-system ds/cilium | grep -i error
kubectl logs -n kube-system -l app=calico-node -c calico-node | tail -50
 
## For AWS VPC CNI: check IP pool exhaustion
kubectl describe node <name> | grep -A5 "vpc.amazonaws.com/eni-max-pods"

Network latency: Use netshoot debug container for packet-level analysis.

kubectl debug -it --image=nicolaka/netshoot --target=payment-service \
  pod/payment-service-xyz
 
## Inside netshoot:
tcpdump -i eth0 -n host 10.244.2.3 and port 5432
dig +trace redis.default.svc.cluster.local
iptables-save | grep KUBE-SVC

Production Checklist

Before shipping any Kubernetes workload to production, verify these networking fundamentals:

Pod IP allocation: Is your VPC subnet sized for 3x peak pod count? For AWS VPC CNI, enable prefix delegation (ENABLE_PREFIX_DELEGATION=true) to batch IP allocation.
MTU offset: If using VXLAN (Flannel, Calico overlay), explicitly set pod MTU to 1450. Verify: ip link show | grep mtu.
kube-proxy mode: At 5,000+ services, switch from iptables to IPVS or migrate to Cilium eBPF.
CoreDNS scale: Set pod dnsConfig.ndots: 1 to eliminate DNS query floods. Scale CoreDNS HPA to at least 2 replicas with CPU-based autoscaling.
Network policies: Test that deny-all policies actually block traffic. Flannel doesn't enforce them by default.
Service traffic policy: Use externalTrafficPolicy: Local on LoadBalancer/NodePort services to preserve client IP and avoid extra hop (requires one pod per node).
DNS endpoints: Always use FQDN with trailing dot for external hostnames, or configure ndots:1 in pod specs. ^{[Kubernetes docs]}

Why These Four Layers Matter

Kubernetes networking is four stacked problems:

Pod networking (CNI): Packets from pod to pod across nodes.
Service routing (kube-proxy/eBPF): ClusterIP to pod IP mapping and load balancing.
Name resolution (CoreDNS): "redis.default" → 10.96.42.1.
Ingress: External traffic into the cluster.

Most production failures hide in layers 2 and 3. The team from the opening discovered their VXLAN MTU misconfiguration via packet trace, patched the CNI config to set MTU 1450, and their checkout service returned to baseline latency within minutes.

Eight milliseconds. Invisible to every monitoring tool except a raw packet trace. That's why understanding the implementation — not just the API — is critical.

Cilium eBPF in Production

Replacing kube-proxy with Cilium is the most consequential networking decision a large cluster makes. The headline win is service resolution latency, but the operational benefits compound: no iptables rule explosion, native L7 policy enforcement, identity-based security, and Hubble flow observability without sidecars. Verify the replacement is actually engaged — running Cilium alongside kube-proxy is a common misconfiguration that gives you the worst of both worlds.

## Confirm full kube-proxy replacement and key datapath features
kubectl -n kube-system exec ds/cilium -- cilium status --verbose \
  | grep -E "KubeProxyReplacement|Host Routing|Masquerading|BPF"
 
## Should report:
##   KubeProxyReplacement:   True   [eth0 10.0.1.5 (Direct Routing)]
##   Host Routing:           BPF
##   Masquerading:           BPF   [eth0]   10.244.0.0/16 [IPv4]

The metrics that prove it worked: p99 service-call latency drops by the iptables traversal cost (typically 30–80µs at 5,000 services), node_netfilter_conntrack_count flattens, and CPU consumed by kube-proxy and iptables-restore disappears entirely from node profiles. Watch cilium_bpf_map_pressure for any map approaching capacity (> 0.9 means raise the limits before the datapath starts dropping). Pair Cilium with Hubble to get per-flow visibility — hubble observe --verdict DROPPED replaces three hours of tcpdump archaeology when a NetworkPolicy denies traffic you didn't expect.

Multi-Cluster Networking

Once you outgrow a single cluster — for blast-radius isolation, regional locality, or tenant separation — you need a story for cross-cluster service discovery. There is no single right answer; pick by the failure mode you most want to avoid.

Submariner tunnels pod and service CIDRs between clusters via IPsec or WireGuard. Simplest to bolt onto existing clusters with non-overlapping CIDRs; the tunnel gateway is a throughput and failure pinch point.
Cilium Cluster Mesh federates clusters at the eBPF layer, sharing service endpoints natively without an overlay. Lowest latency and the cleanest policy story, but requires Cilium on every cluster and stable, routable pod CIDRs between them.
Istio multi-cluster layers cross-cluster service discovery on top of mTLS. The richest L7 traffic-management story (locality-aware failover, traffic shifting), at the cost of running the full mesh control plane and east-west gateways.

The operational tax is real: certificate rotation, asymmetric routing, and split-brain DNS show up first. Start with one shared service (auth, payments) before federating the entire catalogue.

Debugging "Pod Can't Reach Service"

When a pod fails to reach a service, walk the layers in order. Skipping a step costs hours.

DNS resolves. kubectl exec pod -- nslookup api.default.svc.cluster.local returns a ClusterIP. If NXDOMAIN, check CoreDNS pods, the pod's /etc/resolv.conf, and any NetworkPolicy blocking UDP 53 to kube-system.
Service has endpoints. kubectl get endpoints api -o wide lists at least one ready pod IP. An empty list means the selector doesn't match running pods or readiness probes are failing.
kube-proxy programmed the rules. On the source node, confirm DNAT rules exist for the ClusterIP, or that Cilium has loaded the service map.
Target pod is reachable. kubectl exec pod -- nc -vz <pod-ip> <port> from the source node namespace. Failure here points at CNI, MTU, or NetworkPolicy.
NetworkPolicy isn't denying it. Test the path with the policy temporarily relaxed; with Cilium, hubble observe --to-pod default/api --verdict DROPPED names the rule.

## A tight loop that walks the tree on a real cluster
SRC=$(kubectl get pod -l app=checkout -o jsonpath='{.items[0].metadata.name}')
SVC=api
kubectl exec "$SRC" -- nslookup "$SVC".default.svc.cluster.local
kubectl get endpoints "$SVC" -o wide
SVC_IP=$(kubectl get svc "$SVC" -o jsonpath='{.spec.clusterIP}')
kubectl exec "$SRC" -- nc -vz "$SVC_IP" 80
kubectl exec "$SRC" -- wget -qO- --timeout=3 "http://$SVC/healthz"

The large majority of "service unreachable" investigations end at step 2 (no ready endpoints) or step 5 (a default-deny NetworkPolicy nobody remembered) — check those two first before tracing deeper.

Frequently Asked Questions

What is a CNI plugin in Kubernetes?

A CNI (Container Network Interface) plugin implements the Kubernetes networking model — every pod gets its own IP, and pods communicate across nodes without NAT. Common plugins: Flannel (VXLAN), Calico (BGP or VXLAN), Cilium (eBPF).

What causes mystery latency in pod-to-pod networking?

VXLAN overlays add a 50-byte header per packet. If the node MTU is 1500 but CNI MTU is not set to 1450, packets over 1450 bytes are silently fragmented by the kernel, adding 5-15ms latency invisible to application monitoring.

How does kube-proxy route traffic to pods?

kube-proxy watches the API for Service and Endpoint changes, then programs iptables rules (or IPVS entries) on each node. When traffic hits a Service ClusterIP, these rules DNAT the packet to a randomly selected pod IP for load balancing.

What is the CoreDNS ndots tax?

The default ndots:5 setting causes up to 5 failed DNS queries for every external hostname lookup. Set ndots:1 or use node-local DNS caching to eliminate this overhead at scale.

Keep Reading

Essential Kubernetes Commands Cheat Sheet — Quick-reference kubectl commands for pods, deployments, debugging, and cluster administration
Essential Docker Commands Cheat Sheet — Container lifecycle, image management, and multi-stage build patterns for the containers Kubernetes orchestrates
The 3 Pillars of Observability — Metrics, logs, and traces for monitoring the network layer and diagnosing the failures described in this article

Was this article helpful?

Your feedback directly shapes our editorial depth and technical accuracy.