My pod is stuck in CrashLoopBackOff. Where do I look first?

Run kubectl describe pod and scroll to the Events section—it shows why the container failed. If logs are empty, use kubectl logs --previous to see the last run's logs before the crash.

How do I safely update a deployment without downtime?

kubectl set image deployment/myapp myapp=myapp:v2.0 triggers a rolling update. Use kubectl rollout status deployment/myapp to watch progress. If it fails, kubectl rollout undo deployment/myapp rolls back to the previous version.

A pod is OOMKilled. How do I increase memory without redeploying?

kubectl set resources deployment/myapp --limits=memory=1Gi --requests=memory=512Mi patches the pod spec. Kubernetes will restart the pods with new limits. For production, update your manifests to avoid manual patches.

How do I run a temporary debugging pod in my cluster?

kubectl run debug --image=nicolaka/netshoot -it brings up a pod with networking tools. Or, kubectl debug -it pod/mypod --image=nicolaka/netshoot injects a debug container into a running pod.

What's the difference between Deployments and StatefulSets?

Deployments scale stateless services in any order (APIs, web servers). StatefulSets guarantee pod identity (pod-0, pod-1, pod-2) and bind PersistentVolumes per pod (databases, Kafka, Redis). Use StatefulSets for services that need stable networking or persistent storage.

#kubernetes #kubectl #devops #cheat-sheet #container-orchestration

Essential Kubernetes Commands: The Complete kubectl Cheat Sheet

BackendBytes Engineering Team

Feb 12, 2026

15 min read

Essential Kubernetes Commands: The Complete kubectl Cheat Sheet

Key Takeaways

→`kubectl describe pod` Events section reveals root cause — `CrashLoopBackOff` means check pending or error states, not logs; logs won't exist if container dies before startup
→`kubectl logs --previous` shows the previous crash's logs; crucial when a pod has restarted and current logs are clean but the failure happened on the last run
→`kubectl set resources deployment/webapp --limits=memory=512Mi` patches without redeploying — fast fix for OOMKilled during incidents when you can't wait for a full rollout
→`kubectl top pods --sort-by=memory` finds the memory leak that dashboards don't — 30Mi/minute leaks are invisible in p50 latency but compound into OOMKilled within hours
→StatefulSets order pods as `pod-0`, `pod-1`, etc. and bind persistent storage — use for databases/Kafka; Deployments for stateless services where order doesn't matter

The alert fired at 3 AM: CrashLoopBackOff on the payment service. The on-call engineer ran kubectl logs — nothing. The container was dying before writing to stdout. A quick kubectl describe pod revealed OOMKilled. The memory limit was 512Mi, but the service was leaking 30Mi per minute. kubectl top pods --sort-by=memory confirmed. She bumped the limit with kubectl set resources, drained traffic, and pushed a hotfix. Triage: 8 minutes. Without fluent kubectl, two hours of guessing.

TL;DR

kubectl get, logs, describe, and exec are your core triage verbs. Pair them with --previous, --all-containers, and field selectors for 80% of production incidents. Deployments scale and rollout; StatefulSets order pods and bind storage. Use tables to decide what workload type you need, then apply. ^{[Kubernetes docs]}

Inspect first: get, describe, logs with timestamps and multi-container support
Triage systematically: pending → events; crash → --previous; wrong → exec into the pod
Control deployments: rollout, scale, patch, and diff before applying

Triage by Symptom, Not by Concept

When pages fire, the question is never "what does kubectl do" — it's "where is my pod broken." Route by symptom:

graph TD
    Page[Pod or service is broken] --> What{What is<br/>the symptom?}
    What -->|Pod stuck Pending| Pending[describe pod<br/>→ Events section]
    What -->|Pod CrashLoopBackOff| Crash[logs --previous<br/>→ describe pod]
    What -->|Pod Running but wrong| Wrong[exec -it pod -- sh<br/>+ logs -f]
    What -->|Service unreachable| Net[get endpoints<br/>+ get svc<br/>+ describe svc]
    What -->|Deployment stuck rolling| Roll[rollout status<br/>+ rollout history<br/>+ rollout undo]
    What -->|Resource pressure| Top[top pods --sort-by=memory<br/>+ describe node]
    Pending -->|FailedScheduling| Sched[Check node taints,<br/>resource requests,<br/>nodeSelector]
    Pending -->|ImagePullBackOff| Pull[Check imagePullSecrets,<br/>registry creds, image tag]
    Crash -->|Exit code| Exit[1: app error<br/>137: OOMKilled<br/>143: SIGTERM timeout]
    style Pending fill:#fdd
    style Crash fill:#fdd
    style Wrong fill:#ffd
    style Net fill:#fdd
    style Roll fill:#ffd
    style Top fill:#dfd

Most kubectl confusion is "I don't know which command to run" — the diagram routes you to one of seven leaf commands. Every section below is the deep dive on one branch^{[Kubernetes docs]}.

The Quick Start

These 10 commands handle 80% of triage. Bookmark this table. ^{[Kubernetes docs]}

Command	Purpose	Example
`kubectl get pods`	List pods in namespace	`get pods -A` for all namespaces
`kubectl describe pod {name}`	Pod state + events	Scroll to Events section for root cause
`kubectl logs {pod}`	Container stdout/stderr	`logs -f` for live tail; `-p` for previous crash
`kubectl logs {pod} --all-containers`	All containers in pod	For multi-container pods; use `-c` for one
`kubectl exec -it {pod} -- sh`	Shell into pod	For inspecting state at runtime
`kubectl port-forward {pod} 8080:8080`	Access pod from localhost	For dev debugging without exposing service
`kubectl get deployment {name}`	Deployment status	`scale {name} --replicas=5` to scale
`kubectl rollout status deployment/{name}`	Rolling update progress	Waits until rollout completes
`kubectl top pods --sort-by=memory`	Pod resource usage	Find memory leaks and CPU hotspots
`kubectl get events -A --sort-by='.metadata.creationTimestamp'`	Cluster-wide events	Last 10: `tail -10` at the end

Pod Inspection and Logs

^{[Kubernetes docs]}

# List pods with node and IP
kubectl get pods -o wide --show-labels
 
# Get logs with timestamps (live)
kubectl logs -f {pod} --timestamps=true
 
# Previous logs after crash
kubectl logs {pod} --previous
 
# All containers in one pod
kubectl logs {pod} --all-containers=true --tail=50 --since=10m
 
# Get events (often the root cause)
kubectl describe pod {pod}  # Scroll to Events section
 
# Execute a one-off command
kubectl exec {pod} -- curl localhost:8080/health
 
# Interactive shell
kubectl exec -it {pod} -- /bin/bash
 
# Port forward for debugging
kubectl port-forward {pod} 8080:8080
 
# Ephemeral debug container (shares PID namespace)
kubectl debug -it pod/{pod} --image=nicolaka/netshoot --target={container}
 
# Resource usage
kubectl top pods --all-namespaces --sort-by=memory

Workload Types

^{[Kubernetes docs]}

Pick the right abstraction first:

Workload	Use	Pod Names	Storage	Scale
Deployment	Stateless (APIs, web)	Interchangeable	Shared	Any order
StatefulSet	Stateful (databases, Kafka)	`pod-0`, `pod-1`, ...	Per-pod PVC	Ordered
DaemonSet	Node agents (logging, monitoring)	One per node	Host	Auto (1 per node)

Deployments and Rollouts

# Create deployment
kubectl create deployment webapp --image=nginx:1.27-alpine --replicas=3
 
# Update image (rolling update)
kubectl set image deployment/webapp nginx=nginx:1.27-alpine
 
# Restart pods without config change
kubectl rollout restart deployment/webapp
 
# Watch rollout progress
kubectl rollout status deployment/webapp
 
# View rollout history
kubectl rollout history deployment/webapp
 
# Rollback to previous revision
kubectl rollout undo deployment/webapp
 
# Scale deployment
kubectl scale deployment/webapp --replicas=5
 
# Auto-scale by CPU
kubectl autoscale deployment/webapp --cpu-percent=80 --min=2 --max=10

StatefulSets and DaemonSets

# StatefulSet pods are ordered: db-0, db-1, db-2
kubectl get pods -l app=db
 
# Scale StatefulSet (ordered creation/deletion)
kubectl scale statefulset/db --replicas=5
 
# Delete a StatefulSet pod (recreates with same PVC)
kubectl delete pod db-2
 
# List PVCs for StatefulSet
kubectl get pvc -l app=db
 
# List DaemonSets across cluster
kubectl get daemonset -A
 
# Update DaemonSet image (rolling per node)
kubectl set image daemonset/fluentd fluentd=fluentd:v1.17

Services and Networking

^{[Kubernetes docs]}

Type	Access	Use
`ClusterIP`	Internal only	Microservice-to-microservice
`NodePort`	`<NodeIP>:30000-32767`	Dev/testing
`LoadBalancer`	External LB	Production external traffic
`ExternalName`	DNS CNAME	External services

# Expose deployment as ClusterIP
kubectl expose deployment webapp --type=ClusterIP --port=80 --target-port=8080
 
# Create LoadBalancer service
kubectl expose deployment webapp --type=LoadBalancer --port=80 --target-port=8080
 
# Port forward from pod to localhost
kubectl port-forward pod/webapp 8080:8080
 
# Port forward from service
kubectl port-forward service/webapp 8080:80
 
# Test DNS inside cluster (service FQDN)
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -- nslookup webapp-service.default.svc.cluster.local
 
# Get service endpoints (pod IPs backing the service)
kubectl get endpoints webapp-service
 
# Get all network policies
kubectl get networkpolicy -A
 
# Describe ingress
kubectl describe ingress webapp-ingress
 
# Networking deep dive: [Kubernetes Networking Deep Dive](/articles/kubernetes-networking-deep-dive/)

ConfigMaps and Secrets

# ConfigMap from literals
kubectl create configmap app-config \
  --from-literal=db_host=postgres.example.com \
  --from-literal=db_port=5432
 
# ConfigMap from files
kubectl create configmap app-config --from-file=config/
 
# View ConfigMap
kubectl get configmap app-config -o yaml
 
# Generic secret
kubectl create secret generic db-credentials \
  --from-literal=username=admin \
  --from-literal=password=secret
 
# TLS secret
kubectl create secret tls webapp-tls --cert=webapp.crt --key=webapp.key
 
# Image pull secret
kubectl create secret docker-registry regcred \
  --docker-server=registry.example.com --docker-username=user --docker-password=pass
 
# Decode a secret (base64 -d, not encrypted)
kubectl get secret db-credentials -o jsonpath='{.data.password}' | base64 -d

Base64 is not encryption. For production, enable encryption at rest or use Vault/Sealed Secrets.

Jobs and CronJobs

# Create a one-off job
kubectl create job db-migrate --image=myapp:latest -- /app/migrate.sh
 
# Watch job
kubectl get jobs -w
 
# Get job logs
kubectl logs job/db-migrate
 
# Job with parallelism and completions
kubectl create job batch-process --image=worker:latest -- /process.sh
kubectl patch job batch-process -p '{"spec":{"parallelism":5,"completions":100}}'
 
# Create CronJob
kubectl create cronjob daily-backup --image=backup:latest --schedule="0 2 * * *" -- /backup.sh
 
# List CronJobs
kubectl get cronjobs
 
# Manually trigger CronJob (test without waiting)
kubectl create job manual-backup --from=cronjob/daily-backup
 
# Suspend CronJob
kubectl patch cronjob daily-backup -p '{"spec":{"suspend":true}}'

Default: CronJobs allow concurrent runs. Set concurrencyPolicy: Forbid to prevent overlaps.

Storage and Volumes

# List PersistentVolumes (cluster-wide)
kubectl get pv
 
# List PersistentVolumeClaims (namespace-scoped)
kubectl get pvc
 
# Describe PVC (binding status, events)
kubectl describe pvc data-db-0
 
# List storage classes
kubectl get storageclass
 
# Expand PVC (StorageClass must allow it)
kubectl patch pvc data-db-0 -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
 
# Check expansion status
kubectl get pvc data-db-0

Default reclaim policy is Delete — disk is destroyed with PVC. For stateful workloads, use reclaimPolicy: Retain.

Quotas and Limits

# List resource quotas in namespace
kubectl get resourcequota -n production
 
# Describe quota usage (cpu, memory, pods, pvcs)
kubectl describe resourcequota compute-quota -n production
 
# List LimitRanges (default pod limits)
kubectl get limitrange -n production
 
# Describe LimitRange
kubectl describe limitrange default-limits -n production

Scheduling, Taints, and Affinity

# Label a node
kubectl label node worker-1 disk=ssd
 
# View node labels
kubectl get nodes --show-labels
 
# Taint a node (prevents scheduling unless tolerated)
kubectl taint nodes worker-3 dedicated=gpu:NoSchedule
 
# Remove taint
kubectl taint nodes worker-3 dedicated=gpu:NoSchedule-
 
# View taints on a node
kubectl describe node worker-3 | grep -A5 Taints

Affinity and topology spread constraints are defined in pod specs, not via kubectl commands. Pod anti-affinity spreads replicas across zones or nodes.

Pod Disruption Budgets

# List PDBs
kubectl get pdb
 
# Describe PDB status
kubectl describe pdb webapp-pdb

PDBs protect availability during voluntary disruptions (drains, upgrades). Set minAvailable < replicas or PDB will block node drains.

RBAC and Permissions

^{[Kubernetes docs]}

# Check if ServiceAccount can perform action
kubectl auth can-i get pods --as=system:serviceaccount:webapp:webapp-sa -n webapp
 
# List all ServiceAccount permissions
kubectl auth can-i --list --as=system:serviceaccount:webapp:webapp-sa -n webapp
 
# Get pod's ServiceAccount
kubectl get pod {pod} -o jsonpath='{.spec.serviceAccountName}'
 
# Describe ClusterRoleBinding
kubectl describe clusterrolebinding webapp-admin
 
# Get all RoleBindings in namespace
kubectl get rolebindings,clusterrolebindings -n webapp -o wide
 
# Create a Role
kubectl create role pod-reader --verb=get,list,watch --resource=pods -n webapp
 
# Bind Role to ServiceAccount
kubectl create rolebinding pod-reader-binding --role=pod-reader --serviceaccount=webapp:webapp-sa -n webapp
 
# Create ClusterRole
kubectl create clusterrole node-reader --verb=get,list --resource=nodes
 
# Bind ClusterRole
kubectl create clusterrolebinding node-reader-binding --clusterrole=node-reader --serviceaccount=webapp:webapp-sa

Custom Resources and Operators

# List all CRDs (operators, cert-manager, Istio, etc.)
kubectl get crd
 
# List instances of a custom resource
kubectl get certificates.cert-manager.io -A
 
# Describe a custom resource
kubectl describe certificate webapp-tls -n production
 
# Explore CRD schema (field reference)
kubectl explain certificate.spec
kubectl explain certificate.spec.issuerRef

Troubleshooting Triage

^{[Kubernetes docs]}

Pod Status → Action:

Pending: kubectl describe pod → Check Events section (scheduling, resources, PVC)
CrashLoopBackOff: kubectl logs --previous → app crash, config error, or OOM
ImagePullBackOff: kubectl describe pod → image name typo, missing imagePullSecret, registry auth
Running but misbehaving: kubectl exec -it -- sh → check env, network, DNS, service discovery

# Core debugging
kubectl describe pod {pod}  # See Events section
 
# Get warning events (failures, OOM, probe failures)
kubectl get events --field-selector type=Warning --sort-by='.metadata.creationTimestamp'
 
# Get recent events cluster-wide
kubectl get events -A --sort-by='.metadata.creationTimestamp' | tail -20
 
# Check service endpoints (does label selector match?)
kubectl get endpoints {service}
 
# Test connectivity inside cluster
kubectl run debug-pod --image=nicolaka/netshoot --rm -it --restart=Never -- bash
 
# Diff before applying
kubectl diff -f deployment.yaml
 
# Wait for pods to be ready (CI/CD)
kubectl wait --for=condition=ready pod -l app=webapp --timeout=120s
 
# Resource usage
kubectl top nodes --sort-by=cpu
kubectl top pods -A --sort-by=memory

Applying and Patching

# Apply from YAML
kubectl apply -f deployment.yaml
 
# Patch a deployment (JSON merge patch)
kubectl patch deployment webapp -p '{"spec":{"replicas":3}}'
 
# Patch a ConfigMap (strategic merge)
kubectl patch configmap app-config --type merge -p '{"data":{"debug":"false"}}'
 
# Dry-run with server-side validation
kubectl apply -f deployment.yaml --validate=true --dry-run=server
 
# Apply with pruning (delete resources not in manifests)
kubectl apply -f ./k8s/ --prune -l app=webapp
 
# Set resource requests/limits
kubectl set resources deployment/webapp --requests=cpu=100m,memory=128Mi --limits=cpu=200m,memory=256Mi
 
# Set environment variables
kubectl set env deployment/webapp NODE_ENV=production
 
# Copy file from pod
kubectl cp {pod}:/path/to/file /local/path
 
# Copy file to pod
kubectl cp /local/file {pod}:/path/to/file

Helm Commands

# Add chart repo
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
 
# Search for charts
helm search repo postgres
 
# Install chart
helm install my-postgres bitnami/postgresql \
  --namespace databases --create-namespace \
  --set auth.postgresPassword=secret \
  --set primary.persistence.size=50Gi
 
# List releases
helm list -A
 
# Get release values
helm get values my-postgres -n databases
 
# Upgrade release (--reuse-values keeps previous settings)
helm upgrade my-postgres bitnami/postgresql \
  --namespace databases \
  --set primary.persistence.size=100Gi \
  --reuse-values
 
# Rollback to previous revision
helm rollback my-postgres 1 -n databases
 
# Uninstall release
helm uninstall my-postgres -n databases
 
# Preview rendered YAML (no install)
helm template my-postgres bitnami/postgresql --values custom-values.yaml

Kustomize and kubectl debug

# Apply with kustomize overlay
kubectl apply -k overlays/production/
 
# Preview kustomize output
kubectl kustomize overlays/production/
 
# Diff kustomize against live cluster
kubectl diff -k overlays/production/
 
# Debug container (ephemeral, shares PID namespace)
kubectl debug -it pod/{pod} --image=nicolaka/netshoot --target={container}
 
# Debug pod with copy (non-disruptive)
kubectl debug pod/{pod} --copy-to=debug-pod --image=ubuntu --share-processes
 
# Debug node (privileged pod with host filesystem)
kubectl debug node/{node} -it --image=ubuntu

Kustomize is built into kubectl. No Helm required — patches YAML declaratively.

Shortcuts and Aliases

# Add to ~/.bashrc or ~/.zshrc
alias k='kubectl'
alias kgp='kubectl get pods'
alias kgs='kubectl get services'
alias kgd='kubectl get deployments'
alias kdp='kubectl describe pod'
alias kl='kubectl logs -f'
 
# Shell completion (bash)
source <(kubectl completion bash)
complete -o default -F __start_kubectl k
 
# Context and namespace
kubectl config use-context production-cluster
kubectl config set-context --current --namespace=webapp-namespace
kubectl config current-context
 
# krew plugins (plugin manager)
kubectl krew install neat   # Clean YAML (remove managed fields)
kubectl krew install tree   # Resource ownership hierarchy
kubectl krew install ctx    # Fast context switching
kubectl krew install ns     # Fast namespace switching
 
# JSONPath queries
kubectl get pods -o custom-columns="NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[*].restartCount,NODE:.spec.nodeName"
 
# Find pods with >5 restarts
kubectl get pods -A -o jsonpath='{range .items[?(@.status.containerStatuses[*].restartCount>5)]}{.metadata.namespace}{"\t"}{.metadata.name}{"\n"}{end}'
 
# All node IPs
kubectl get nodes -o jsonpath='{.items[*].status.addresses[?(@.type=="InternalIP")].address}'
 
# Image versions in namespace
kubectl get pods -n my-app -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].image}{"\n"}{end}'

When to use what

Workload Type	Resource	When	Watch Out
Stateless services (API, web, workers)	Deployment	Default choice; rolling updates, scale any order, tolerates pod loss	Don't use for databases or Kafka — need ordering
Stateful services (databases, Kafka, Redis)	StatefulSet	Pods have stable names (db-0, db-1); each has own PVC; ordered startup/shutdown	Deleting StatefulSet doesn't auto-delete PVCs; you lose data if you're not careful
Node agents (logging, monitoring, CNI)	DaemonSet	One pod per node; auto-scales with cluster; tolerate node taints	Don't use for APIs — will schedule on every node including masters
Transient work (batch, migration, CI)	Job	Run once or N times in parallel; completes and exits gracefully	CronJobs allow concurrent runs by default; set `concurrencyPolicy: Forbid` to prevent pile-up
External service integration	ExternalName Service	Route to external domain (RDS, managed database, SaaS API)	Limited to DNS; no load balancing within cluster
Internal service discovery	ClusterIP Service	Default; pods find each other by DNS name within cluster	Changes to pod IPs don't break service (DNS handles it)
Dev/testing external access	NodePort Service	Cheap external access during development; exposes port 30000-32767	Never use in production — not load-balanced, port conflicts if >1 node
Production external traffic	LoadBalancer Service	Cloud provider LB (AWS, GCP, Azure); automatic DNS/cert management integration	Expensive per service; use Ingress for 10+ services
HTTP(S) routing by hostname/path	Ingress	Route `api.example.com` and `web.example.com` to different services; TLS termination; path-based routing	Single Ingress per domain saves costs; complex rules get hard to debug
Newer API-driven networking	Gateway API	More flexible than Ingress; standardizes cross-cloud routing (Kubernetes + OpenShift + Envoy)	Still stabilizing; not all CNIs support it yet
Prevent cluster resource exhaustion	ResourceQuota + LimitRange	Quota = namespace-level hard limits; LimitRange = per-pod defaults	Quota blocks new pods if namespace is full; test quota limits before prod
Horizontal scaling by CPU	HPA (HorizontalPodAutoscaler)	React to CPU/memory surge in minutes; cost-optimal for unpredictable traffic	Slow — takes 1-3 min to spin up new pods; won't save you from traffic spikes
Vertical scaling (bigger pods)	VPA (VerticalPodAutoscaler)	Auto-adjust resource requests based on actual usage; prevents OOM without manual tuning	Requires pod recreation; requires multiple replicas to be safe (can't VPA StatefulSet replicas 1)
Fast scaling to external metrics	KEDA (Kubernetes Event Autoscaling)	Scale on queue depth, HTTP latency, Prometheus queries (not just CPU)	More complex; separate component to maintain
Single pod crash shouldn't break service	Pod Disruption Budget (PDB)	Set `minAvailable: 2` for critical services; protects against voluntary disruptions	Too strict (`minAvailable: replicas`) blocks cluster maintenance forever

Gotchas that bite in production

kubectl delete pod without grace period kills mid-request traffic
- You force-delete a pod with kubectl delete pod myapp-0 --grace-period=0. The container gets SIGKILL immediately (no shutdown hook). In-flight requests fail with connection reset. SLA breach.
- Fix: Default grace period is 30 seconds (good). The pod receives SIGTERM, has 30 seconds to drain requests, then gets SIGKILL. Only use --grace-period=0 for truly stuck pods. Always pair with pre-stop hooks: lifecycle: { preStop: { exec: { command: ["/bin/sh", "-c", "sleep 15"] } } } to finish in-flight requests.
OOMKilled pods restart silently; metrics dashboards don't catch it until 10 restarts
- Pod is leaking memory, gets OOMKilled every 2 minutes. Kubelet restarts it automatically. Your dashboards only alert on "pod restarts > 5 in 10 minutes". By then it's restarted 50 times. Users hit errors for 30 minutes.
- Fix: Set memory limit lower than actual peak (e.g., 512Mi for an app that peaks at 600Mi). Pod fails fast and visibly. Pair with liveness probe that detects memory pressure early. Monitor container_memory_working_set_bytes in Prometheus for creeping growth.
Missing readiness probe means Kubernetes sends traffic before app is ready
- Pod starts, DNS registered, Service endpoint added. App is still initializing (connecting to DB, warming caches). First 10 requests fail with 503. Traffic sent before readiness check passed.
- Fix: Always define readinessProbe: { httpGet: { path: /health, port: 8080 }, initialDelaySeconds: 5, periodSeconds: 5 }. Kubernetes waits for the probe to pass before adding to Service endpoints. Set initialDelaySeconds to cover your longest startup time.
Deleting a StatefulSet leaves its PVCs behind — but a later PVC delete can still destroy the disk
- kubectl delete statefulset my-db does not delete the PVCs from its volumeClaimTemplates — by default they're retained (persistentVolumeClaimRetentionPolicy defaults to Retain), so your 500GB volumes silently keep costing money after the workload is gone. The real data-loss trap is the reverse: when you later kubectl delete pvc data-my-db-0, if the StorageClass reclaim policy is Delete (the common dynamic-provisioner default), the bound PV and its underlying disk are destroyed with it — no undo.
- Fix: To reclaim storage after removing a StatefulSet you must delete the PVCs explicitly. To protect production data, set the StorageClass reclaimPolicy: Retain so deleting a PVC detaches the PV instead of wiping it, and only use Delete for dev/test. Set the StatefulSet's persistentVolumeClaimRetentionPolicy if you actually want scale-down/deletion to clean up PVCs automatically.
HPA can't keep up with traffic spikes; pods are still scaling while users error out
- Traffic spikes from 10 to 1,000 RPS. HPA sees CPU at 80%, triggers scale from 3 to 20 pods. Takes 2 minutes to provision 17 new pods and pass readiness checks. Meanwhile, remaining 3 pods are getting 333 RPS each, timing out.
- Fix: Set HPA scaleDownStabilizationWindow: 300s (don't scale down for 5 min) to prevent flapping. Pair with PodDisruptionBudget: minAvailable: 2 so cluster maintenance doesn't evict pods during scale-up. For predictable spikes, use time-based scaling (cron HPA) or pre-warm clusters. ^{[Kubernetes docs]}
Kubernetes DNS not resolving in pods because CoreDNS is stuck or evicted
- Pod can't reach db.default.svc.cluster.local. nslookup times out. Whole app degraded. You think it's a network issue; actually CoreDNS pod was evicted and hasn't restarted.
- Fix: Run kubectl get pods -n kube-system | grep coredns — must have 2+ replicas. Set ResourceQuota in kube-system so CoreDNS can't be evicted. Monitor DNS response times with kubectl run dns-test --image=busybox -it --rm -- nslookup kubernetes.default.svc.cluster.local as a canary.

Production Checklist

Resources: every pod has requests and limits (prevents starvation and runaway consumption)
Probes: liveness and readiness probes defined (Kubernetes can restart/evict misbehaving pods)
PVC reclaim: reclaimPolicy: Retain for databases (prevents accidental deletion)
PDB: minAvailable set for critical services (protects against voluntary disruptions)
RBAC: ServiceAccount restricted to minimal permissions (principle of least privilege)
Secrets: use external manager (Vault, Sealed Secrets) or encryption at rest
Events: monitor cluster events regularly (kubectl get events)
Quotas: set ResourceQuota per namespace (prevents runaway resource consumption)

Frequently Asked Questions

How do I find a pod by label?

Use kubectl get pods -l app=webapp,env=production to filter by one or more labels. Combine with -A to search across all namespaces.

Why is my pod stuck in Pending?

Run kubectl describe pod {pod} and read the Events section. Common causes: insufficient cluster resources, an unbound PVC, a node selector that no node satisfies, or a missing image pull secret.

How do I capture traffic from a pod?

Use ephemeral debug containers: kubectl debug -it pod/{pod} --image=nicolaka/netshoot --target={container}, then tcpdump -i eth0 -w capture.pcap from inside the debug container.

Can I edit a running pod?

No — kubectl edit pod changes don't persist. Edit the Deployment spec instead: kubectl set image deployment/{name} {container}={image}, or edit the YAML and kubectl apply -f.

What's the difference between kubectl exec and kubectl debug?

exec requires the container image to have a shell. debug creates an ephemeral debug container (works against distroless images) that shares the target pod's network and process namespaces.

How do I know if my change will break anything?

Always kubectl diff -f deployment.yaml or use --dry-run=server before applying. For Helm: helm template to render the chart locally and review the output.

Keep Reading

Kubernetes Networking Deep Dive — CNI plugins, kube-proxy, CoreDNS debugging, and production failures
Essential Docker Commands Cheat Sheet — Container lifecycle, image layers, multi-stage builds
Terraform in Production — Provisioning clusters and state management
Linux Commands Cheat Sheet — When kubectl exec lands you in a container, the next layer is Linux triage: ss, lsof, journalctl
SRE: SLOs, SLIs, and Error Budgets — When the burn-rate alert fires, kubectl is the first triage tool

Was this article helpful?

Your feedback directly shapes our editorial depth and technical accuracy.

BackendBytes Engineering Team

Engineering Team

A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.

Essential Kubernetes Commands: The Complete kubectl Cheat Sheet

Key Takeaways

Triage by Symptom, Not by Concept

The Quick Start

Pod Inspection and Logs

Workload Types

Deployments and Rollouts

StatefulSets and DaemonSets

Services and Networking

ConfigMaps and Secrets

Jobs and CronJobs

Storage and Volumes

Quotas and Limits

Scheduling, Taints, and Affinity

Pod Disruption Budgets

RBAC and Permissions

Custom Resources and Operators

Troubleshooting Triage

Applying and Patching

Helm Commands

Kustomize and kubectl debug

Shortcuts and Aliases

When to use what

Gotchas that bite in production

Production Checklist

Frequently Asked Questions

Keep Reading

Was this article helpful?

Read Next

Terraform in Production: Modules, State Management, and CI/CD Patterns

Essential Docker Commands: The Complete Cheat Sheet

Essential Linux Commands: A Backend Engineer's Cheat Sheet

Terraform in Production: Modules, State Management, and CI/CD Patterns

Essential Docker Commands: The Complete Cheat Sheet

Essential Linux Commands: A Backend Engineer's Cheat Sheet