Skip to content

Essential Kubernetes Commands: The Complete kubectl Cheat Sheet

BackendBytes Engineering Team
BackendBytes Engineering Team
5 min read
Essential Kubernetes Commands: The Complete kubectl Cheat Sheet

Key Takeaways

  • `kubectl describe pod` Events section reveals root cause — `CrashLoopBackOff` means check pending or error states, not logs; logs won't exist if container dies before startup
  • `kubectl logs --previous` shows the previous crash's logs; crucial when a pod has restarted and current logs are clean but the failure happened on the last run
  • `kubectl set resources deployment/webapp --limits=memory=512Mi` patches without redeploying — fast fix for OOMKilled during incidents when you can't wait for a full rollout
  • `kubectl top pods --sort-by=memory` finds the memory leak that dashboards don't — 30Mi/minute leaks are invisible in p50 latency but compound into OOMKilled within hours
  • StatefulSets order pods as `pod-0`, `pod-1`, etc. and bind persistent storage — use for databases/Kafka; Deployments for stateless services where order doesn't matter

The alert fired at 3 AM: CrashLoopBackOff on the payment service. The on-call engineer ran kubectl logs — nothing. The container was dying before writing to stdout. A quick kubectl describe pod revealed OOMKilled. The memory limit was 512Mi, but the service was leaking 30Mi per minute. kubectl top pods --sort-by=memory confirmed. She bumped the limit with kubectl set resources, drained traffic, and pushed a hotfix. Triage: 8 minutes. Without fluent kubectl, two hours of guessing.

TL;DR

kubectl get, logs, describe, and exec are your core triage verbs. Pair them with --previous, --all-containers, and field selectors for 80% of production incidents. Deployments scale and rollout; StatefulSets order pods and bind storage. Use tables to decide what workload type you need, then apply. [Kubernetes docs]

  • Inspect first: get, describe, logs with timestamps and multi-container support
  • Triage systematically: pending → events; crash → --previous; wrong → exec into the pod
  • Control deployments: rollout, scale, patch, and diff before applying

Triage by Symptom, Not by Concept

When pages fire, the question is never "what does kubectl do" — it's "where is my pod broken." Route by symptom:

graph TD
    Page[Pod or service is broken] --> What{What is<br/>the symptom?}
    What -->|Pod stuck Pending| Pending[describe pod<br/>→ Events section]
    What -->|Pod CrashLoopBackOff| Crash[logs --previous<br/>→ describe pod]
    What -->|Pod Running but wrong| Wrong[exec -it pod -- sh<br/>+ logs -f]
    What -->|Service unreachable| Net[get endpoints<br/>+ get svc<br/>+ describe svc]
    What -->|Deployment stuck rolling| Roll[rollout status<br/>+ rollout history<br/>+ rollout undo]
    What -->|Resource pressure| Top[top pods --sort-by=memory<br/>+ describe node]
    Pending -->|FailedScheduling| Sched[Check node taints,<br/>resource requests,<br/>nodeSelector]
    Pending -->|ImagePullBackOff| Pull[Check imagePullSecrets,<br/>registry creds, image tag]
    Crash -->|Exit code| Exit[1: app error<br/>137: OOMKilled<br/>143: SIGTERM timeout]
    style Pending fill:#fdd
    style Crash fill:#fdd
    style Wrong fill:#ffd
    style Net fill:#fdd
    style Roll fill:#ffd
    style Top fill:#dfd

Most kubectl confusion is "I don't know which command to run" — the diagram routes you to one of seven leaf commands. Every section below is the deep dive on one branch[Kubernetes docs].

The Quick Start

These 10 commands handle 80% of triage. Bookmark this table. [Kubernetes docs]

CommandPurposeExample
kubectl get podsList pods in namespaceget pods -A for all namespaces
kubectl describe pod {name}Pod state + eventsScroll to Events section for root cause
kubectl logs {pod}Container stdout/stderrlogs -f for live tail; -p for previous crash
kubectl logs {pod} --all-containersAll containers in podFor multi-container pods; use -c for one
kubectl exec -it {pod} -- shShell into podFor inspecting state at runtime
kubectl port-forward {pod} 8080:8080Access pod from localhostFor dev debugging without exposing service
kubectl get deployment {name}Deployment statusscale {name} --replicas=5 to scale
kubectl rollout status deployment/{name}Rolling update progressWaits until rollout completes
kubectl top pods --sort-by=memoryPod resource usageFind memory leaks and CPU hotspots
kubectl get events -A --sort-by='.metadata.creationTimestamp'Cluster-wide eventsLast 10: tail -10 at the end

Pod Inspection and Logs

[Kubernetes docs]
# List pods with node and IP
kubectl get pods -o wide --show-labels
 
# Get logs with timestamps (live)
kubectl logs -f {pod} --timestamps=true
 
# Previous logs after crash
kubectl logs {pod} --previous
 
# All containers in one pod
kubectl logs {pod} --all-containers=true --tail=50 --since=10m
 
# Get events (often the root cause)
kubectl describe pod {pod}  # Scroll to Events section
 
# Execute a one-off command
kubectl exec {pod} -- curl localhost:8080/health
 
# Interactive shell
kubectl exec -it {pod} -- /bin/bash
 
# Port forward for debugging
kubectl port-forward {pod} 8080:8080
 
# Ephemeral debug container (shares PID namespace)
kubectl debug -it pod/{pod} --image=nicolaka/netshoot --target={container}
 
# Resource usage
kubectl top pods --all-namespaces --sort-by=memory

Workload Types

[Kubernetes docs]

Pick the right abstraction first:

WorkloadUsePod NamesStorageScale
DeploymentStateless (APIs, web)InterchangeableSharedAny order
StatefulSetStateful (databases, Kafka)pod-0, pod-1, ...Per-pod PVCOrdered
DaemonSetNode agents (logging, monitoring)One per nodeHostAuto (1 per node)

Deployments and Rollouts

# Create deployment
kubectl create deployment webapp --image=nginx:1.27-alpine --replicas=3
 
# Update image (rolling update)
kubectl set image deployment/webapp nginx=nginx:1.27-alpine
 
# Restart pods without config change
kubectl rollout restart deployment/webapp
 
# Watch rollout progress
kubectl rollout status deployment/webapp
 
# View rollout history
kubectl rollout history deployment/webapp
 
# Rollback to previous revision
kubectl rollout undo deployment/webapp
 
# Scale deployment
kubectl scale deployment/webapp --replicas=5
 
# Auto-scale by CPU
kubectl autoscale deployment/webapp --cpu-percent=80 --min=2 --max=10

StatefulSets and DaemonSets

# StatefulSet pods are ordered: db-0, db-1, db-2
kubectl get pods -l app=db
 
# Scale StatefulSet (ordered creation/deletion)
kubectl scale statefulset/db --replicas=5
 
# Delete a StatefulSet pod (recreates with same PVC)
kubectl delete pod db-2
 
# List PVCs for StatefulSet
kubectl get pvc -l app=db
 
# List DaemonSets across cluster
kubectl get daemonset -A
 
# Update DaemonSet image (rolling per node)
kubectl set image daemonset/fluentd fluentd=fluentd:v1.17

Services and Networking

[Kubernetes docs]
TypeAccessUse
ClusterIPInternal onlyMicroservice-to-microservice
NodePort<NodeIP>:30000-32767Dev/testing
LoadBalancerExternal LBProduction external traffic
ExternalNameDNS CNAMEExternal services
# Expose deployment as ClusterIP
kubectl expose deployment webapp --type=ClusterIP --port=80 --target-port=8080
 
# Create LoadBalancer service
kubectl expose deployment webapp --type=LoadBalancer --port=80 --target-port=8080
 
# Port forward from pod to localhost
kubectl port-forward pod/webapp 8080:8080
 
# Port forward from service
kubectl port-forward service/webapp 8080:80
 
# Test DNS inside cluster (service FQDN)
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -- nslookup webapp-service.default.svc.cluster.local
 
# Get service endpoints (pod IPs backing the service)
kubectl get endpoints webapp-service
 
# Get all network policies
kubectl get networkpolicy -A
 
# Describe ingress
kubectl describe ingress webapp-ingress
 
# Networking deep dive: [Kubernetes Networking Deep Dive](/articles/kubernetes-networking-deep-dive/)

ConfigMaps and Secrets

# ConfigMap from literals
kubectl create configmap app-config \
  --from-literal=db_host=postgres.example.com \
  --from-literal=db_port=5432
 
# ConfigMap from files
kubectl create configmap app-config --from-file=config/
 
# View ConfigMap
kubectl get configmap app-config -o yaml
 
# Generic secret
kubectl create secret generic db-credentials \
  --from-literal=username=admin \
  --from-literal=password=secret
 
# TLS secret
kubectl create secret tls webapp-tls --cert=webapp.crt --key=webapp.key
 
# Image pull secret
kubectl create secret docker-registry regcred \
  --docker-server=registry.example.com --docker-username=user --docker-password=pass
 
# Decode a secret (base64 -d, not encrypted)
kubectl get secret db-credentials -o jsonpath='{.data.password}' | base64 -d

Base64 is not encryption. For production, enable encryption at rest or use Vault/Sealed Secrets.

Jobs and CronJobs

# Create a one-off job
kubectl create job db-migrate --image=myapp:latest -- /app/migrate.sh
 
# Watch job
kubectl get jobs -w
 
# Get job logs
kubectl logs job/db-migrate
 
# Job with parallelism and completions
kubectl create job batch-process --image=worker:latest -- /process.sh
kubectl patch job batch-process -p '{"spec":{"parallelism":5,"completions":100}}'
 
# Create CronJob
kubectl create cronjob daily-backup --image=backup:latest --schedule="0 2 * * *" -- /backup.sh
 
# List CronJobs
kubectl get cronjobs
 
# Manually trigger CronJob (test without waiting)
kubectl create job manual-backup --from=cronjob/daily-backup
 
# Suspend CronJob
kubectl patch cronjob daily-backup -p '{"spec":{"suspend":true}}'

Default: CronJobs allow concurrent runs. Set concurrencyPolicy: Forbid to prevent overlaps.

Storage and Volumes

# List PersistentVolumes (cluster-wide)
kubectl get pv
 
# List PersistentVolumeClaims (namespace-scoped)
kubectl get pvc
 
# Describe PVC (binding status, events)
kubectl describe pvc data-db-0
 
# List storage classes
kubectl get storageclass
 
# Expand PVC (StorageClass must allow it)
kubectl patch pvc data-db-0 -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
 
# Check expansion status
kubectl get pvc data-db-0

Default reclaim policy is Delete — disk is destroyed with PVC. For stateful workloads, use reclaimPolicy: Retain.

Quotas and Limits

# List resource quotas in namespace
kubectl get resourcequota -n production
 
# Describe quota usage (cpu, memory, pods, pvcs)
kubectl describe resourcequota compute-quota -n production
 
# List LimitRanges (default pod limits)
kubectl get limitrange -n production
 
# Describe LimitRange
kubectl describe limitrange default-limits -n production

Scheduling, Taints, and Affinity

# Label a node
kubectl label node worker-1 disk=ssd
 
# View node labels
kubectl get nodes --show-labels
 
# Taint a node (prevents scheduling unless tolerated)
kubectl taint nodes worker-3 dedicated=gpu:NoSchedule
 
# Remove taint
kubectl taint nodes worker-3 dedicated=gpu:NoSchedule-
 
# View taints on a node
kubectl describe node worker-3 | grep -A5 Taints

Affinity and topology spread constraints are defined in pod specs, not via kubectl commands. Pod anti-affinity spreads replicas across zones or nodes.

Pod Disruption Budgets

# List PDBs
kubectl get pdb
 
# Describe PDB status
kubectl describe pdb webapp-pdb

PDBs protect availability during voluntary disruptions (drains, upgrades). Set minAvailable < replicas or PDB will block node drains.

RBAC and Permissions

[Kubernetes docs]
# Check if ServiceAccount can perform action
kubectl auth can-i get pods --as=system:serviceaccount:webapp:webapp-sa -n webapp
 
# List all ServiceAccount permissions
kubectl auth can-i --list --as=system:serviceaccount:webapp:webapp-sa -n webapp
 
# Get pod's ServiceAccount
kubectl get pod {pod} -o jsonpath='{.spec.serviceAccountName}'
 
# Describe ClusterRoleBinding
kubectl describe clusterrolebinding webapp-admin
 
# Get all RoleBindings in namespace
kubectl get rolebindings,clusterrolebindings -n webapp -o wide
 
# Create a Role
kubectl create role pod-reader --verb=get,list,watch --resource=pods -n webapp
 
# Bind Role to ServiceAccount
kubectl create rolebinding pod-reader-binding --role=pod-reader --serviceaccount=webapp:webapp-sa -n webapp
 
# Create ClusterRole
kubectl create clusterrole node-reader --verb=get,list --resource=nodes
 
# Bind ClusterRole
kubectl create clusterrolebinding node-reader-binding --clusterrole=node-reader --serviceaccount=webapp:webapp-sa

Custom Resources and Operators

# List all CRDs (operators, cert-manager, Istio, etc.)
kubectl get crd
 
# List instances of a custom resource
kubectl get certificates.cert-manager.io -A
 
# Describe a custom resource
kubectl describe certificate webapp-tls -n production
 
# Explore CRD schema (field reference)
kubectl explain certificate.spec
kubectl explain certificate.spec.issuerRef

Troubleshooting Triage

[Kubernetes docs]

Pod Status → Action:

  • Pending: kubectl describe pod → Check Events section (scheduling, resources, PVC)
  • CrashLoopBackOff: kubectl logs --previous → app crash, config error, or OOM
  • ImagePullBackOff: kubectl describe pod → image name typo, missing imagePullSecret, registry auth
  • Running but misbehaving: kubectl exec -it -- sh → check env, network, DNS, service discovery
# Core debugging
kubectl describe pod {pod}  # See Events section
 
# Get warning events (failures, OOM, probe failures)
kubectl get events --field-selector type=Warning --sort-by='.metadata.creationTimestamp'
 
# Get recent events cluster-wide
kubectl get events -A --sort-by='.metadata.creationTimestamp' | tail -20
 
# Check service endpoints (does label selector match?)
kubectl get endpoints {service}
 
# Test connectivity inside cluster
kubectl run debug-pod --image=nicolaka/netshoot --rm -it --restart=Never -- bash
 
# Diff before applying
kubectl diff -f deployment.yaml
 
# Wait for pods to be ready (CI/CD)
kubectl wait --for=condition=ready pod -l app=webapp --timeout=120s
 
# Resource usage
kubectl top nodes --sort-by=cpu
kubectl top pods -A --sort-by=memory

Applying and Patching

# Apply from YAML
kubectl apply -f deployment.yaml
 
# Patch a deployment (JSON merge patch)
kubectl patch deployment webapp -p '{"spec":{"replicas":3}}'
 
# Patch a ConfigMap (strategic merge)
kubectl patch configmap app-config --type merge -p '{"data":{"debug":"false"}}'
 
# Dry-run with server-side validation
kubectl apply -f deployment.yaml --validate=true --dry-run=server
 
# Apply with pruning (delete resources not in manifests)
kubectl apply -f ./k8s/ --prune -l app=webapp
 
# Set resource requests/limits
kubectl set resources deployment/webapp --requests=cpu=100m,memory=128Mi --limits=cpu=200m,memory=256Mi
 
# Set environment variables
kubectl set env deployment/webapp NODE_ENV=production
 
# Copy file from pod
kubectl cp {pod}:/path/to/file /local/path
 
# Copy file to pod
kubectl cp /local/file {pod}:/path/to/file

Helm Commands

# Add chart repo
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
 
# Search for charts
helm search repo postgres
 
# Install chart
helm install my-postgres bitnami/postgresql \
  --namespace databases --create-namespace \
  --set auth.postgresPassword=secret \
  --set primary.persistence.size=50Gi
 
# List releases
helm list -A
 
# Get release values
helm get values my-postgres -n databases
 
# Upgrade release (--reuse-values keeps previous settings)
helm upgrade my-postgres bitnami/postgresql \
  --namespace databases \
  --set primary.persistence.size=100Gi \
  --reuse-values
 
# Rollback to previous revision
helm rollback my-postgres 1 -n databases
 
# Uninstall release
helm uninstall my-postgres -n databases
 
# Preview rendered YAML (no install)
helm template my-postgres bitnami/postgresql --values custom-values.yaml

Kustomize and kubectl debug

# Apply with kustomize overlay
kubectl apply -k overlays/production/
 
# Preview kustomize output
kubectl kustomize overlays/production/
 
# Diff kustomize against live cluster
kubectl diff -k overlays/production/
 
# Debug container (ephemeral, shares PID namespace)
kubectl debug -it pod/{pod} --image=nicolaka/netshoot --target={container}
 
# Debug pod with copy (non-disruptive)
kubectl debug pod/{pod} --copy-to=debug-pod --image=ubuntu --share-processes
 
# Debug node (privileged pod with host filesystem)
kubectl debug node/{node} -it --image=ubuntu

Kustomize is built into kubectl. No Helm required — patches YAML declaratively.

Shortcuts and Aliases

# Add to ~/.bashrc or ~/.zshrc
alias k='kubectl'
alias kgp='kubectl get pods'
alias kgs='kubectl get services'
alias kgd='kubectl get deployments'
alias kdp='kubectl describe pod'
alias kl='kubectl logs -f'
 
# Shell completion (bash)
source <(kubectl completion bash)
complete -o default -F __start_kubectl k
 
# Context and namespace
kubectl config use-context production-cluster
kubectl config set-context --current --namespace=webapp-namespace
kubectl config current-context
 
# krew plugins (plugin manager)
kubectl krew install neat   # Clean YAML (remove managed fields)
kubectl krew install tree   # Resource ownership hierarchy
kubectl krew install ctx    # Fast context switching
kubectl krew install ns     # Fast namespace switching
 
# JSONPath queries
kubectl get pods -o custom-columns="NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[*].restartCount,NODE:.spec.nodeName"
 
# Find pods with >5 restarts
kubectl get pods -A -o jsonpath='{range .items[?(@.status.containerStatuses[*].restartCount>5)]}{.metadata.namespace}{"\t"}{.metadata.name}{"\n"}{end}'
 
# All node IPs
kubectl get nodes -o jsonpath='{.items[*].status.addresses[?(@.type=="InternalIP")].address}'
 
# Image versions in namespace
kubectl get pods -n my-app -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].image}{"\n"}{end}'

When to use what

Workload TypeResourceWhenWatch Out
Stateless services (API, web, workers)DeploymentDefault choice; rolling updates, scale any order, tolerates pod lossDon't use for databases or Kafka — need ordering
Stateful services (databases, Kafka, Redis)StatefulSetPods have stable names (db-0, db-1); each has own PVC; ordered startup/shutdownDeleting StatefulSet doesn't auto-delete PVCs; you lose data if you're not careful
Node agents (logging, monitoring, CNI)DaemonSetOne pod per node; auto-scales with cluster; tolerate node taintsDon't use for APIs — will schedule on every node including masters
Transient work (batch, migration, CI)JobRun once or N times in parallel; completes and exits gracefullyCronJobs allow concurrent runs by default; set concurrencyPolicy: Forbid to prevent pile-up
External service integrationExternalName ServiceRoute to external domain (RDS, managed database, SaaS API)Limited to DNS; no load balancing within cluster
Internal service discoveryClusterIP ServiceDefault; pods find each other by DNS name within clusterChanges to pod IPs don't break service (DNS handles it)
Dev/testing external accessNodePort ServiceCheap external access during development; exposes port 30000-32767Never use in production — not load-balanced, port conflicts if >1 node
Production external trafficLoadBalancer ServiceCloud provider LB (AWS, GCP, Azure); automatic DNS/cert management integrationExpensive per service; use Ingress for 10+ services
HTTP(S) routing by hostname/pathIngressRoute api.example.com and web.example.com to different services; TLS termination; path-based routingSingle Ingress per domain saves costs; complex rules get hard to debug
Newer API-driven networkingGateway APIMore flexible than Ingress; standardizes cross-cloud routing (Kubernetes + OpenShift + Envoy)Still stabilizing; not all CNIs support it yet
Prevent cluster resource exhaustionResourceQuota + LimitRangeQuota = namespace-level hard limits; LimitRange = per-pod defaultsQuota blocks new pods if namespace is full; test quota limits before prod
Horizontal scaling by CPUHPA (HorizontalPodAutoscaler)React to CPU/memory surge in minutes; cost-optimal for unpredictable trafficSlow — takes 1-3 min to spin up new pods; won't save you from traffic spikes
Vertical scaling (bigger pods)VPA (VerticalPodAutoscaler)Auto-adjust resource requests based on actual usage; prevents OOM without manual tuningRequires pod recreation; requires multiple replicas to be safe (can't VPA StatefulSet replicas 1)
Fast scaling to external metricsKEDA (Kubernetes Event Autoscaling)Scale on queue depth, HTTP latency, Prometheus queries (not just CPU)More complex; separate component to maintain
Single pod crash shouldn't break servicePod Disruption Budget (PDB)Set minAvailable: 2 for critical services; protects against voluntary disruptionsToo strict (minAvailable: replicas) blocks cluster maintenance forever

Gotchas that bite in production

  1. kubectl delete pod without grace period kills mid-request traffic

    • You force-delete a pod with kubectl delete pod myapp-0 --grace-period=0. The container gets SIGKILL immediately (no shutdown hook). In-flight requests fail with connection reset. SLA breach.
    • Fix: Default grace period is 30 seconds (good). The pod receives SIGTERM, has 30 seconds to drain requests, then gets SIGKILL. Only use --grace-period=0 for truly stuck pods. Always pair with pre-stop hooks: lifecycle: { preStop: { exec: { command: ["/bin/sh", "-c", "sleep 15"] } } } to finish in-flight requests.
  2. OOMKilled pods restart silently; metrics dashboards don't catch it until 10 restarts

    • Pod is leaking memory, gets OOMKilled every 2 minutes. Kubelet restarts it automatically. Your dashboards only alert on "pod restarts > 5 in 10 minutes". By then it's restarted 50 times. Users hit errors for 30 minutes.
    • Fix: Set memory limit lower than actual peak (e.g., 512Mi for an app that peaks at 600Mi). Pod fails fast and visibly. Pair with liveness probe that detects memory pressure early. Monitor container_memory_working_set_bytes in Prometheus for creeping growth.
  3. Missing readiness probe means Kubernetes sends traffic before app is ready

    • Pod starts, DNS registered, Service endpoint added. App is still initializing (connecting to DB, warming caches). First 10 requests fail with 503. Traffic sent before readiness check passed.
    • Fix: Always define readinessProbe: { httpGet: { path: /health, port: 8080 }, initialDelaySeconds: 5, periodSeconds: 5 }. Kubernetes waits for the probe to pass before adding to Service endpoints. Set initialDelaySeconds to cover your longest startup time.
  4. PVC deleted when StatefulSet is deleted (reclaim policy = Delete is default)

    • You kubectl delete statefulset my-db to "clean up". Kubernetes deletes all PVCs. Your 500GB database is gone. Restore from backup if you have one (you do, right?).
    • Fix: Set persistentVolumeReclaimPolicy: Retain in PVC or StorageClass for databases. Deleted PVC stays around; you can manually delete it or reattach to a new pod. For dev/test, use Delete. For production stateful workloads, always use Retain.
  5. HPA can't keep up with traffic spikes; pods are still scaling while users error out

    • Traffic spikes from 10 to 1,000 RPS. HPA sees CPU at 80%, triggers scale from 3 to 20 pods. Takes 2 minutes to provision 17 new pods and pass readiness checks. Meanwhile, remaining 3 pods are getting 333 RPS each, timing out.
    • Fix: Set HPA scaleDownStabilizationWindow: 300s (don't scale down for 5 min) to prevent flapping. Pair with PodDisruptionBudget: minAvailable: 2 so cluster maintenance doesn't evict pods during scale-up. For predictable spikes, use time-based scaling (cron HPA) or pre-warm clusters. [Kubernetes docs]
  6. Kubernetes DNS not resolving in pods because CoreDNS is stuck or evicted

    • Pod can't reach db.default.svc.cluster.local. nslookup times out. Whole app degraded. You think it's a network issue; actually CoreDNS pod was evicted and hasn't restarted.
    • Fix: Run kubectl get pods -n kube-system | grep coredns — must have 2+ replicas. Set ResourceQuota in kube-system so CoreDNS can't be evicted. Monitor DNS response times with kubectl run dns-test --image=busybox -it --rm -- nslookup kubernetes.default.svc.cluster.local as a canary.

Production Checklist

  • Resources: every pod has requests and limits (prevents starvation and runaway consumption)
  • Probes: liveness and readiness probes defined (Kubernetes can restart/evict misbehaving pods)
  • PVC reclaim: reclaimPolicy: Retain for databases (prevents accidental deletion)
  • PDB: minAvailable set for critical services (protects against voluntary disruptions)
  • RBAC: ServiceAccount restricted to minimal permissions (principle of least privilege)
  • Secrets: use external manager (Vault, Sealed Secrets) or encryption at rest
  • Events: monitor cluster events regularly (kubectl get events)
  • Quotas: set ResourceQuota per namespace (prevents runaway resource consumption)

Frequently Asked Questions

How do I find a pod by label?

Use kubectl get pods -l app=webapp,env=production to filter by one or more labels. Combine with -A to search across all namespaces.

Why is my pod stuck in Pending?

Run kubectl describe pod {pod} and read the Events section. Common causes: insufficient cluster resources, an unbound PVC, a node selector that no node satisfies, or a missing image pull secret.

How do I capture traffic from a pod?

Use ephemeral debug containers: kubectl debug -it pod/{pod} --image=nicolaka/netshoot --target={container}, then tcpdump -i eth0 -w capture.pcap from inside the debug container.

Can I edit a running pod?

No — kubectl edit pod changes don't persist. Edit the Deployment spec instead: kubectl set image deployment/{name} {container}={image}, or edit the YAML and kubectl apply -f.

What's the difference between kubectl exec and kubectl debug?

exec requires the container image to have a shell. debug creates an ephemeral debug container (works against distroless images) that shares the target pod's network and process namespaces.

How do I know if my change will break anything?

Always kubectl diff -f deployment.yaml or use --dry-run=server before applying. For Helm: helm template to render the chart locally and review the output.

Keep Reading

BackendBytes Engineering Team
BackendBytes Engineering Team

Engineering Team

A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.

Read Next