Essential Kubernetes Commands: The Complete kubectl Cheat Sheet
Key Takeaways
- →`kubectl describe pod` Events section reveals root cause — `CrashLoopBackOff` means check pending or error states, not logs; logs won't exist if container dies before startup
- →`kubectl logs --previous` shows the previous crash's logs; crucial when a pod has restarted and current logs are clean but the failure happened on the last run
- →`kubectl set resources deployment/webapp --limits=memory=512Mi` patches without redeploying — fast fix for OOMKilled during incidents when you can't wait for a full rollout
- →`kubectl top pods --sort-by=memory` finds the memory leak that dashboards don't — 30Mi/minute leaks are invisible in p50 latency but compound into OOMKilled within hours
- →StatefulSets order pods as `pod-0`, `pod-1`, etc. and bind persistent storage — use for databases/Kafka; Deployments for stateless services where order doesn't matter
The alert fired at 3 AM:
CrashLoopBackOffon the payment service. The on-call engineer rankubectl logs— nothing. The container was dying before writing to stdout. A quickkubectl describe podrevealedOOMKilled. The memory limit was 512Mi, but the service was leaking 30Mi per minute.kubectl top pods --sort-by=memoryconfirmed. She bumped the limit withkubectl set resources, drained traffic, and pushed a hotfix. Triage: 8 minutes. Without fluentkubectl, two hours of guessing.
kubectl get, logs, describe, and exec are your core triage verbs. Pair them with --previous, --all-containers, and field selectors for 80% of production incidents. Deployments scale and rollout; StatefulSets order pods and bind storage. Use tables to decide what workload type you need, then apply. [Kubernetes docs]
- Inspect first:
get,describe,logswith timestamps and multi-container support - Triage systematically: pending → events; crash →
--previous; wrong →execinto the pod - Control deployments: rollout, scale, patch, and diff before applying
Triage by Symptom, Not by Concept
When pages fire, the question is never "what does kubectl do" — it's "where is my pod broken." Route by symptom:
graph TD
Page[Pod or service is broken] --> What{What is<br/>the symptom?}
What -->|Pod stuck Pending| Pending[describe pod<br/>→ Events section]
What -->|Pod CrashLoopBackOff| Crash[logs --previous<br/>→ describe pod]
What -->|Pod Running but wrong| Wrong[exec -it pod -- sh<br/>+ logs -f]
What -->|Service unreachable| Net[get endpoints<br/>+ get svc<br/>+ describe svc]
What -->|Deployment stuck rolling| Roll[rollout status<br/>+ rollout history<br/>+ rollout undo]
What -->|Resource pressure| Top[top pods --sort-by=memory<br/>+ describe node]
Pending -->|FailedScheduling| Sched[Check node taints,<br/>resource requests,<br/>nodeSelector]
Pending -->|ImagePullBackOff| Pull[Check imagePullSecrets,<br/>registry creds, image tag]
Crash -->|Exit code| Exit[1: app error<br/>137: OOMKilled<br/>143: SIGTERM timeout]
style Pending fill:#fdd
style Crash fill:#fdd
style Wrong fill:#ffd
style Net fill:#fdd
style Roll fill:#ffd
style Top fill:#dfd
Most kubectl confusion is "I don't know which command to run" — the diagram routes you to one of seven leaf commands. Every section below is the deep dive on one branch[Kubernetes docs].
The Quick Start
These 10 commands handle 80% of triage. Bookmark this table. [Kubernetes docs]
| Command | Purpose | Example |
|---|---|---|
kubectl get pods | List pods in namespace | get pods -A for all namespaces |
kubectl describe pod {name} | Pod state + events | Scroll to Events section for root cause |
kubectl logs {pod} | Container stdout/stderr | logs -f for live tail; -p for previous crash |
kubectl logs {pod} --all-containers | All containers in pod | For multi-container pods; use -c for one |
kubectl exec -it {pod} -- sh | Shell into pod | For inspecting state at runtime |
kubectl port-forward {pod} 8080:8080 | Access pod from localhost | For dev debugging without exposing service |
kubectl get deployment {name} | Deployment status | scale {name} --replicas=5 to scale |
kubectl rollout status deployment/{name} | Rolling update progress | Waits until rollout completes |
kubectl top pods --sort-by=memory | Pod resource usage | Find memory leaks and CPU hotspots |
kubectl get events -A --sort-by='.metadata.creationTimestamp' | Cluster-wide events | Last 10: tail -10 at the end |
Pod Inspection and Logs
[Kubernetes docs]# List pods with node and IP
kubectl get pods -o wide --show-labels
# Get logs with timestamps (live)
kubectl logs -f {pod} --timestamps=true
# Previous logs after crash
kubectl logs {pod} --previous
# All containers in one pod
kubectl logs {pod} --all-containers=true --tail=50 --since=10m
# Get events (often the root cause)
kubectl describe pod {pod} # Scroll to Events section
# Execute a one-off command
kubectl exec {pod} -- curl localhost:8080/health
# Interactive shell
kubectl exec -it {pod} -- /bin/bash
# Port forward for debugging
kubectl port-forward {pod} 8080:8080
# Ephemeral debug container (shares PID namespace)
kubectl debug -it pod/{pod} --image=nicolaka/netshoot --target={container}
# Resource usage
kubectl top pods --all-namespaces --sort-by=memoryWorkload Types
[Kubernetes docs]Pick the right abstraction first:
| Workload | Use | Pod Names | Storage | Scale |
|---|---|---|---|---|
| Deployment | Stateless (APIs, web) | Interchangeable | Shared | Any order |
| StatefulSet | Stateful (databases, Kafka) | pod-0, pod-1, ... | Per-pod PVC | Ordered |
| DaemonSet | Node agents (logging, monitoring) | One per node | Host | Auto (1 per node) |
Deployments and Rollouts
# Create deployment
kubectl create deployment webapp --image=nginx:1.27-alpine --replicas=3
# Update image (rolling update)
kubectl set image deployment/webapp nginx=nginx:1.27-alpine
# Restart pods without config change
kubectl rollout restart deployment/webapp
# Watch rollout progress
kubectl rollout status deployment/webapp
# View rollout history
kubectl rollout history deployment/webapp
# Rollback to previous revision
kubectl rollout undo deployment/webapp
# Scale deployment
kubectl scale deployment/webapp --replicas=5
# Auto-scale by CPU
kubectl autoscale deployment/webapp --cpu-percent=80 --min=2 --max=10StatefulSets and DaemonSets
# StatefulSet pods are ordered: db-0, db-1, db-2
kubectl get pods -l app=db
# Scale StatefulSet (ordered creation/deletion)
kubectl scale statefulset/db --replicas=5
# Delete a StatefulSet pod (recreates with same PVC)
kubectl delete pod db-2
# List PVCs for StatefulSet
kubectl get pvc -l app=db
# List DaemonSets across cluster
kubectl get daemonset -A
# Update DaemonSet image (rolling per node)
kubectl set image daemonset/fluentd fluentd=fluentd:v1.17Services and Networking
[Kubernetes docs]| Type | Access | Use |
|---|---|---|
ClusterIP | Internal only | Microservice-to-microservice |
NodePort | <NodeIP>:30000-32767 | Dev/testing |
LoadBalancer | External LB | Production external traffic |
ExternalName | DNS CNAME | External services |
# Expose deployment as ClusterIP
kubectl expose deployment webapp --type=ClusterIP --port=80 --target-port=8080
# Create LoadBalancer service
kubectl expose deployment webapp --type=LoadBalancer --port=80 --target-port=8080
# Port forward from pod to localhost
kubectl port-forward pod/webapp 8080:8080
# Port forward from service
kubectl port-forward service/webapp 8080:80
# Test DNS inside cluster (service FQDN)
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -- nslookup webapp-service.default.svc.cluster.local
# Get service endpoints (pod IPs backing the service)
kubectl get endpoints webapp-service
# Get all network policies
kubectl get networkpolicy -A
# Describe ingress
kubectl describe ingress webapp-ingress
# Networking deep dive: [Kubernetes Networking Deep Dive](/articles/kubernetes-networking-deep-dive/)ConfigMaps and Secrets
# ConfigMap from literals
kubectl create configmap app-config \
--from-literal=db_host=postgres.example.com \
--from-literal=db_port=5432
# ConfigMap from files
kubectl create configmap app-config --from-file=config/
# View ConfigMap
kubectl get configmap app-config -o yaml
# Generic secret
kubectl create secret generic db-credentials \
--from-literal=username=admin \
--from-literal=password=secret
# TLS secret
kubectl create secret tls webapp-tls --cert=webapp.crt --key=webapp.key
# Image pull secret
kubectl create secret docker-registry regcred \
--docker-server=registry.example.com --docker-username=user --docker-password=pass
# Decode a secret (base64 -d, not encrypted)
kubectl get secret db-credentials -o jsonpath='{.data.password}' | base64 -dBase64 is not encryption. For production, enable encryption at rest or use Vault/Sealed Secrets.
Jobs and CronJobs
# Create a one-off job
kubectl create job db-migrate --image=myapp:latest -- /app/migrate.sh
# Watch job
kubectl get jobs -w
# Get job logs
kubectl logs job/db-migrate
# Job with parallelism and completions
kubectl create job batch-process --image=worker:latest -- /process.sh
kubectl patch job batch-process -p '{"spec":{"parallelism":5,"completions":100}}'
# Create CronJob
kubectl create cronjob daily-backup --image=backup:latest --schedule="0 2 * * *" -- /backup.sh
# List CronJobs
kubectl get cronjobs
# Manually trigger CronJob (test without waiting)
kubectl create job manual-backup --from=cronjob/daily-backup
# Suspend CronJob
kubectl patch cronjob daily-backup -p '{"spec":{"suspend":true}}'Default: CronJobs allow concurrent runs. Set concurrencyPolicy: Forbid to prevent overlaps.
Storage and Volumes
# List PersistentVolumes (cluster-wide)
kubectl get pv
# List PersistentVolumeClaims (namespace-scoped)
kubectl get pvc
# Describe PVC (binding status, events)
kubectl describe pvc data-db-0
# List storage classes
kubectl get storageclass
# Expand PVC (StorageClass must allow it)
kubectl patch pvc data-db-0 -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
# Check expansion status
kubectl get pvc data-db-0Default reclaim policy is Delete — disk is destroyed with PVC. For stateful workloads, use reclaimPolicy: Retain.
Quotas and Limits
# List resource quotas in namespace
kubectl get resourcequota -n production
# Describe quota usage (cpu, memory, pods, pvcs)
kubectl describe resourcequota compute-quota -n production
# List LimitRanges (default pod limits)
kubectl get limitrange -n production
# Describe LimitRange
kubectl describe limitrange default-limits -n productionScheduling, Taints, and Affinity
# Label a node
kubectl label node worker-1 disk=ssd
# View node labels
kubectl get nodes --show-labels
# Taint a node (prevents scheduling unless tolerated)
kubectl taint nodes worker-3 dedicated=gpu:NoSchedule
# Remove taint
kubectl taint nodes worker-3 dedicated=gpu:NoSchedule-
# View taints on a node
kubectl describe node worker-3 | grep -A5 TaintsAffinity and topology spread constraints are defined in pod specs, not via kubectl commands. Pod anti-affinity spreads replicas across zones or nodes.
Pod Disruption Budgets
# List PDBs
kubectl get pdb
# Describe PDB status
kubectl describe pdb webapp-pdbPDBs protect availability during voluntary disruptions (drains, upgrades). Set minAvailable < replicas or PDB will block node drains.
RBAC and Permissions
[Kubernetes docs]# Check if ServiceAccount can perform action
kubectl auth can-i get pods --as=system:serviceaccount:webapp:webapp-sa -n webapp
# List all ServiceAccount permissions
kubectl auth can-i --list --as=system:serviceaccount:webapp:webapp-sa -n webapp
# Get pod's ServiceAccount
kubectl get pod {pod} -o jsonpath='{.spec.serviceAccountName}'
# Describe ClusterRoleBinding
kubectl describe clusterrolebinding webapp-admin
# Get all RoleBindings in namespace
kubectl get rolebindings,clusterrolebindings -n webapp -o wide
# Create a Role
kubectl create role pod-reader --verb=get,list,watch --resource=pods -n webapp
# Bind Role to ServiceAccount
kubectl create rolebinding pod-reader-binding --role=pod-reader --serviceaccount=webapp:webapp-sa -n webapp
# Create ClusterRole
kubectl create clusterrole node-reader --verb=get,list --resource=nodes
# Bind ClusterRole
kubectl create clusterrolebinding node-reader-binding --clusterrole=node-reader --serviceaccount=webapp:webapp-saCustom Resources and Operators
# List all CRDs (operators, cert-manager, Istio, etc.)
kubectl get crd
# List instances of a custom resource
kubectl get certificates.cert-manager.io -A
# Describe a custom resource
kubectl describe certificate webapp-tls -n production
# Explore CRD schema (field reference)
kubectl explain certificate.spec
kubectl explain certificate.spec.issuerRefTroubleshooting Triage
[Kubernetes docs]Pod Status → Action:
- Pending:
kubectl describe pod→ Check Events section (scheduling, resources, PVC) - CrashLoopBackOff:
kubectl logs --previous→ app crash, config error, or OOM - ImagePullBackOff:
kubectl describe pod→ image name typo, missing imagePullSecret, registry auth - Running but misbehaving:
kubectl exec -it -- sh→ check env, network, DNS, service discovery
# Core debugging
kubectl describe pod {pod} # See Events section
# Get warning events (failures, OOM, probe failures)
kubectl get events --field-selector type=Warning --sort-by='.metadata.creationTimestamp'
# Get recent events cluster-wide
kubectl get events -A --sort-by='.metadata.creationTimestamp' | tail -20
# Check service endpoints (does label selector match?)
kubectl get endpoints {service}
# Test connectivity inside cluster
kubectl run debug-pod --image=nicolaka/netshoot --rm -it --restart=Never -- bash
# Diff before applying
kubectl diff -f deployment.yaml
# Wait for pods to be ready (CI/CD)
kubectl wait --for=condition=ready pod -l app=webapp --timeout=120s
# Resource usage
kubectl top nodes --sort-by=cpu
kubectl top pods -A --sort-by=memoryApplying and Patching
# Apply from YAML
kubectl apply -f deployment.yaml
# Patch a deployment (JSON merge patch)
kubectl patch deployment webapp -p '{"spec":{"replicas":3}}'
# Patch a ConfigMap (strategic merge)
kubectl patch configmap app-config --type merge -p '{"data":{"debug":"false"}}'
# Dry-run with server-side validation
kubectl apply -f deployment.yaml --validate=true --dry-run=server
# Apply with pruning (delete resources not in manifests)
kubectl apply -f ./k8s/ --prune -l app=webapp
# Set resource requests/limits
kubectl set resources deployment/webapp --requests=cpu=100m,memory=128Mi --limits=cpu=200m,memory=256Mi
# Set environment variables
kubectl set env deployment/webapp NODE_ENV=production
# Copy file from pod
kubectl cp {pod}:/path/to/file /local/path
# Copy file to pod
kubectl cp /local/file {pod}:/path/to/fileHelm Commands
# Add chart repo
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
# Search for charts
helm search repo postgres
# Install chart
helm install my-postgres bitnami/postgresql \
--namespace databases --create-namespace \
--set auth.postgresPassword=secret \
--set primary.persistence.size=50Gi
# List releases
helm list -A
# Get release values
helm get values my-postgres -n databases
# Upgrade release (--reuse-values keeps previous settings)
helm upgrade my-postgres bitnami/postgresql \
--namespace databases \
--set primary.persistence.size=100Gi \
--reuse-values
# Rollback to previous revision
helm rollback my-postgres 1 -n databases
# Uninstall release
helm uninstall my-postgres -n databases
# Preview rendered YAML (no install)
helm template my-postgres bitnami/postgresql --values custom-values.yamlKustomize and kubectl debug
# Apply with kustomize overlay
kubectl apply -k overlays/production/
# Preview kustomize output
kubectl kustomize overlays/production/
# Diff kustomize against live cluster
kubectl diff -k overlays/production/
# Debug container (ephemeral, shares PID namespace)
kubectl debug -it pod/{pod} --image=nicolaka/netshoot --target={container}
# Debug pod with copy (non-disruptive)
kubectl debug pod/{pod} --copy-to=debug-pod --image=ubuntu --share-processes
# Debug node (privileged pod with host filesystem)
kubectl debug node/{node} -it --image=ubuntuKustomize is built into kubectl. No Helm required — patches YAML declaratively.
Shortcuts and Aliases
# Add to ~/.bashrc or ~/.zshrc
alias k='kubectl'
alias kgp='kubectl get pods'
alias kgs='kubectl get services'
alias kgd='kubectl get deployments'
alias kdp='kubectl describe pod'
alias kl='kubectl logs -f'
# Shell completion (bash)
source <(kubectl completion bash)
complete -o default -F __start_kubectl k
# Context and namespace
kubectl config use-context production-cluster
kubectl config set-context --current --namespace=webapp-namespace
kubectl config current-context
# krew plugins (plugin manager)
kubectl krew install neat # Clean YAML (remove managed fields)
kubectl krew install tree # Resource ownership hierarchy
kubectl krew install ctx # Fast context switching
kubectl krew install ns # Fast namespace switching
# JSONPath queries
kubectl get pods -o custom-columns="NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[*].restartCount,NODE:.spec.nodeName"
# Find pods with >5 restarts
kubectl get pods -A -o jsonpath='{range .items[?(@.status.containerStatuses[*].restartCount>5)]}{.metadata.namespace}{"\t"}{.metadata.name}{"\n"}{end}'
# All node IPs
kubectl get nodes -o jsonpath='{.items[*].status.addresses[?(@.type=="InternalIP")].address}'
# Image versions in namespace
kubectl get pods -n my-app -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].image}{"\n"}{end}'When to use what
| Workload Type | Resource | When | Watch Out |
|---|---|---|---|
| Stateless services (API, web, workers) | Deployment | Default choice; rolling updates, scale any order, tolerates pod loss | Don't use for databases or Kafka — need ordering |
| Stateful services (databases, Kafka, Redis) | StatefulSet | Pods have stable names (db-0, db-1); each has own PVC; ordered startup/shutdown | Deleting StatefulSet doesn't auto-delete PVCs; you lose data if you're not careful |
| Node agents (logging, monitoring, CNI) | DaemonSet | One pod per node; auto-scales with cluster; tolerate node taints | Don't use for APIs — will schedule on every node including masters |
| Transient work (batch, migration, CI) | Job | Run once or N times in parallel; completes and exits gracefully | CronJobs allow concurrent runs by default; set concurrencyPolicy: Forbid to prevent pile-up |
| External service integration | ExternalName Service | Route to external domain (RDS, managed database, SaaS API) | Limited to DNS; no load balancing within cluster |
| Internal service discovery | ClusterIP Service | Default; pods find each other by DNS name within cluster | Changes to pod IPs don't break service (DNS handles it) |
| Dev/testing external access | NodePort Service | Cheap external access during development; exposes port 30000-32767 | Never use in production — not load-balanced, port conflicts if >1 node |
| Production external traffic | LoadBalancer Service | Cloud provider LB (AWS, GCP, Azure); automatic DNS/cert management integration | Expensive per service; use Ingress for 10+ services |
| HTTP(S) routing by hostname/path | Ingress | Route api.example.com and web.example.com to different services; TLS termination; path-based routing | Single Ingress per domain saves costs; complex rules get hard to debug |
| Newer API-driven networking | Gateway API | More flexible than Ingress; standardizes cross-cloud routing (Kubernetes + OpenShift + Envoy) | Still stabilizing; not all CNIs support it yet |
| Prevent cluster resource exhaustion | ResourceQuota + LimitRange | Quota = namespace-level hard limits; LimitRange = per-pod defaults | Quota blocks new pods if namespace is full; test quota limits before prod |
| Horizontal scaling by CPU | HPA (HorizontalPodAutoscaler) | React to CPU/memory surge in minutes; cost-optimal for unpredictable traffic | Slow — takes 1-3 min to spin up new pods; won't save you from traffic spikes |
| Vertical scaling (bigger pods) | VPA (VerticalPodAutoscaler) | Auto-adjust resource requests based on actual usage; prevents OOM without manual tuning | Requires pod recreation; requires multiple replicas to be safe (can't VPA StatefulSet replicas 1) |
| Fast scaling to external metrics | KEDA (Kubernetes Event Autoscaling) | Scale on queue depth, HTTP latency, Prometheus queries (not just CPU) | More complex; separate component to maintain |
| Single pod crash shouldn't break service | Pod Disruption Budget (PDB) | Set minAvailable: 2 for critical services; protects against voluntary disruptions | Too strict (minAvailable: replicas) blocks cluster maintenance forever |
Gotchas that bite in production
-
kubectl delete podwithout grace period kills mid-request traffic- You force-delete a pod with
kubectl delete pod myapp-0 --grace-period=0. The container gets SIGKILL immediately (no shutdown hook). In-flight requests fail with connection reset. SLA breach. - Fix: Default grace period is 30 seconds (good). The pod receives SIGTERM, has 30 seconds to drain requests, then gets SIGKILL. Only use
--grace-period=0for truly stuck pods. Always pair with pre-stop hooks:lifecycle: { preStop: { exec: { command: ["/bin/sh", "-c", "sleep 15"] } } }to finish in-flight requests.
- You force-delete a pod with
-
OOMKilled pods restart silently; metrics dashboards don't catch it until 10 restarts
- Pod is leaking memory, gets OOMKilled every 2 minutes. Kubelet restarts it automatically. Your dashboards only alert on "pod restarts > 5 in 10 minutes". By then it's restarted 50 times. Users hit errors for 30 minutes.
- Fix: Set memory limit lower than actual peak (e.g., 512Mi for an app that peaks at 600Mi). Pod fails fast and visibly. Pair with liveness probe that detects memory pressure early. Monitor
container_memory_working_set_bytesin Prometheus for creeping growth.
-
Missing readiness probe means Kubernetes sends traffic before app is ready
- Pod starts, DNS registered, Service endpoint added. App is still initializing (connecting to DB, warming caches). First 10 requests fail with 503. Traffic sent before readiness check passed.
- Fix: Always define
readinessProbe: { httpGet: { path: /health, port: 8080 }, initialDelaySeconds: 5, periodSeconds: 5 }. Kubernetes waits for the probe to pass before adding to Service endpoints. SetinitialDelaySecondsto cover your longest startup time.
-
PVC deleted when StatefulSet is deleted (reclaim policy = Delete is default)
- You
kubectl delete statefulset my-dbto "clean up". Kubernetes deletes all PVCs. Your 500GB database is gone. Restore from backup if you have one (you do, right?). - Fix: Set
persistentVolumeReclaimPolicy: Retainin PVC or StorageClass for databases. Deleted PVC stays around; you can manually delete it or reattach to a new pod. For dev/test, useDelete. For production stateful workloads, always useRetain.
- You
-
HPA can't keep up with traffic spikes; pods are still scaling while users error out
- Traffic spikes from 10 to 1,000 RPS. HPA sees CPU at 80%, triggers scale from 3 to 20 pods. Takes 2 minutes to provision 17 new pods and pass readiness checks. Meanwhile, remaining 3 pods are getting 333 RPS each, timing out.
- Fix: Set HPA
scaleDownStabilizationWindow: 300s(don't scale down for 5 min) to prevent flapping. Pair withPodDisruptionBudget: minAvailable: 2so cluster maintenance doesn't evict pods during scale-up. For predictable spikes, use time-based scaling (cron HPA) or pre-warm clusters. [Kubernetes docs]
-
Kubernetes DNS not resolving in pods because CoreDNS is stuck or evicted
- Pod can't reach
db.default.svc.cluster.local.nslookuptimes out. Whole app degraded. You think it's a network issue; actually CoreDNS pod was evicted and hasn't restarted. - Fix: Run
kubectl get pods -n kube-system | grep coredns— must have 2+ replicas. Set ResourceQuota in kube-system so CoreDNS can't be evicted. Monitor DNS response times withkubectl run dns-test --image=busybox -it --rm -- nslookup kubernetes.default.svc.cluster.localas a canary.
- Pod can't reach
Production Checklist
- Resources: every pod has
requestsandlimits(prevents starvation and runaway consumption) - Probes: liveness and readiness probes defined (Kubernetes can restart/evict misbehaving pods)
- PVC reclaim:
reclaimPolicy: Retainfor databases (prevents accidental deletion) - PDB: minAvailable set for critical services (protects against voluntary disruptions)
- RBAC: ServiceAccount restricted to minimal permissions (principle of least privilege)
- Secrets: use external manager (Vault, Sealed Secrets) or encryption at rest
- Events: monitor cluster events regularly (
kubectl get events) - Quotas: set ResourceQuota per namespace (prevents runaway resource consumption)
Frequently Asked Questions
How do I find a pod by label?
Use kubectl get pods -l app=webapp,env=production to filter by one or more labels. Combine with -A to search across all namespaces.
Why is my pod stuck in Pending?
Run kubectl describe pod {pod} and read the Events section. Common causes: insufficient cluster resources, an unbound PVC, a node selector that no node satisfies, or a missing image pull secret.
How do I capture traffic from a pod?
Use ephemeral debug containers: kubectl debug -it pod/{pod} --image=nicolaka/netshoot --target={container}, then tcpdump -i eth0 -w capture.pcap from inside the debug container.
Can I edit a running pod?
No — kubectl edit pod changes don't persist. Edit the Deployment spec instead: kubectl set image deployment/{name} {container}={image}, or edit the YAML and kubectl apply -f.
What's the difference between kubectl exec and kubectl debug?
exec requires the container image to have a shell. debug creates an ephemeral debug container (works against distroless images) that shares the target pod's network and process namespaces.
How do I know if my change will break anything?
Always kubectl diff -f deployment.yaml or use --dry-run=server before applying. For Helm: helm template to render the chart locally and review the output.
Keep Reading
- Kubernetes Networking Deep Dive — CNI plugins, kube-proxy, CoreDNS debugging, and production failures
- Essential Docker Commands Cheat Sheet — Container lifecycle, image layers, multi-stage builds
- Terraform in Production — Provisioning clusters and state management
- Linux Commands Cheat Sheet — When
kubectl execlands you in a container, the next layer is Linux triage: ss, lsof, journalctl - SRE: SLOs, SLIs, and Error Budgets — When the burn-rate alert fires, kubectl is the first triage tool
Engineering Team
A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.
Read Next
Terraform in Production: Modules, State Management, and CI/CD Patterns
Terraform in production: state locking, module design, environment directories, and CI/CD guardrails that prevent resource destruction.
Essential Docker Commands: The Complete Cheat Sheet
Docker reference: container lifecycle, image management, volumes, networking, and debugging tools for production systems.
Essential Linux Commands: A Backend Engineer's Cheat Sheet
60+ Linux commands for production debugging: processes, networking, kernel tuning, and the gotchas that trip up engineers.