Essential Linux Commands: A Backend Engineer's Cheat Sheet
Key Takeaways
- →`ss -tnp | wc -l` revealed 48,000 orphaned TCP connections when latency spiked from 50ms to 2,000ms — higher than expected connection counts hide exhausted pools and leaked sockets
- →`lsof -a -i -p <PID>` shows what file descriptors and sockets a process holds — a missing `Close()` call leaks connections; seeing 5,000+ sockets per PID confirms it
- →`curl -w "%{time_total}s"` breaks down API latency into DNS + TLS + server time — dashboard shows 200ms, curl shows 180ms TLS handshake revealing the real bottleneck
- →`journalctl -u <service> -b -1` shows logs from the previous boot — when a service fails during startup, current logs are empty; boot-1 contains the actual error
- →`dmesg | grep -i oom` or `journalctl -k` catches kernel OOM-killer activity that application logs never mention — the process was forcibly killed, not error-logged
Production Debugging: When You SSH Into a Burning Server
An e-commerce API's latency spiked 50ms → 2,000ms. Dashboards showed nothing wrong. One ss -tnp | wc -l: 48,000 TCP connections. One lsof -i command revealed a missing Close() call in a recent deploy. Finding it required knowing which Linux commands to reach for.
ss -tnp— TCP connections with process owner (reveals hung clients, exhausted pools)lsof -a -i -p <PID>— what files/sockets is the process holding?top -o %CPU— which process is the CPU hog?journalctl -u <service> -n 50— what did the service log?curl -w "%{time_total}s"— API latency breakdown (DNS + TLS + server)
Triage by Symptom
When you SSH into a burning server, the question is "where is the bottleneck" — not "what does each command do." Route by symptom, not by command name:
graph TD
Burning[Server is on fire] --> Sym{What is<br/>the symptom?}
Sym -->|High CPU| CPU[top -o %CPU<br/>+ pidstat -u 1<br/>+ perf top]
Sym -->|High memory| Mem[free -h<br/>+ ps aux --sort=-%mem<br/>+ smem]
Sym -->|Disk full or slow| Disk[df -h<br/>+ du -sh */<br/>+ iostat -x 1]
Sym -->|Network slow or stuck| Net[ss -tnp<br/>+ ss -s<br/>+ tcpdump -i any port X]
Sym -->|Process stuck| Proc[strace -p PID<br/>+ lsof -p PID<br/>+ cat /proc/PID/wchan]
Sym -->|Service down| Svc[systemctl status<br/>+ journalctl -u svc -b<br/>+ journalctl -p err]
CPU -->|Single hot core| Hot[mpstat -P ALL 1<br/>find which thread]
Mem -->|OOMKilled| OOM["dmesg | grep -i kill<br/>find which proc died"]
Net -->|TCP exhaustion| TCP[ss -s shows TIME-WAIT spike<br/>fix: close, pool, or kernel knobs]
style Burning fill:#fdd
style OOM fill:#fdd
style TCP fill:#fdd
style Hot fill:#ffd
The diagram is the entire production-debugging discipline in one picture: classify the symptom, pick three commands, never start with "let me run htop and see what's happening."
The Essential Dozen
| Command | Use Case |
|---|---|
ss -tnp | TCP connections by process; ss -tnp | grep ESTABLISHED |
lsof -i :8080 | What process owns a port? |
top -o %CPU / htop | CPU/memory per process |
journalctl -u nginx -n 50 | Service logs |
grep "ERROR" app.log | wc -l | Count matching lines |
awk '{print $1}' file | sort | uniq -c | Frequency analysis |
find / -size +100M 2>/dev/null | Large files |
du -sh /path / df -h | Disk usage |
curl -w "%{time_total}s" | API latency |
nc -zv host port | Firewall test |
ssh -L 5432:db:5432 jump | Port forward |
dmesg | grep -i oom | OOM kill check |
File & Text Inspection
[Linux kernel docs]| Command | Example |
|---|---|
ls -lt / stat /path | ls -lt → sort by modification time; stat /etc/nginx.conf → metadata |
find -name "*.log" -mtime -1 / find -size +100M | Find files by name/date/size |
tail -f /var/log/app.log / less | Follow logs / pagewise view |
grep "ERROR" app.log | wc -l | Count lines matching pattern |
awk '{print $1, $3}' / cut -d',' -f1,3 | Extract columns from text |
sort | uniq -c | sort -rn | Deduplicate and rank by frequency |
sed 's/old/new/g' | Stream replace (use sed -i.bak to edit in-place) |
jq '.field' / jq '.data[] | .id' | Parse JSON on CLI |
Process & Performance Monitoring
[Linux kernel docs]The pid → process tree → resource view in one diagram — every triage starts at one of these vertices:
graph TB
PID[PID 1234] --> Top[top / htop<br/>CPU + memory ranking]
PID --> Files[lsof -p 1234<br/>open fds, sockets, files]
PID --> Stack[strace / py-spy / pprof<br/>syscall + stack profile]
PID --> Tree[pstree -p 1234<br/>parent + children + threads]
Tree --> Children[Child PIDs<br/>recursive triage]
System[System-wide<br/>aggregates] --> VM[vmstat 1<br/>cpu, mem, swap, io]
System --> Free[free -h<br/>RAM + swap totals]
System --> Disk[df -h<br/>du -sh path<br/>iostat -x]
System --> Kern[dmesg + journalctl -k<br/>OOM-killer, page faults,<br/>kernel panics]
Kern -.->|grep -i 'killed process'| OOM[OOM-killer victim<br/>+ rss at kill time]
style OOM fill:#fdd
style Top fill:#dfd
style VM fill:#dfd
| Command | Example |
|---|---|
top -o %CPU / htop / ps aux --sort=-%cpu | Find CPU hogs; htop has better UI |
pgrep nginx / pidof nginx | Find process by name for killing/tracing |
lsof -p <PID> / lsof -i :8080 | What's open / which process owns a port? |
vmstat 1 10 / free -h / iostat -x | Memory pressure, swap, I/O load |
df -h / du -sh /path | Filesystem full? Which directory is large? |
dmesg | grep -i oom / journalctl -k | Kernel logs (OOM-killer, crashes) |
Service Management (systemd & journalctl)
[systemd manual]| Command | Example |
|---|---|
systemctl status <service> | Status and recent logs; systemctl restart nginx to cycle |
systemctl enable/disable <service> | Start on boot / don't start on boot |
journalctl -u <service> -n 50 / journalctl -u <service> -f | Last 50 lines / follow logs live |
journalctl -u <service> -b -1 | Logs from previous boot (diagnose startup failures) |
journalctl -k / dmesg | Kernel messages (OOM, driver crashes) |
systemctl --failed | List crashed services |
Crashed service: check systemctl status app, then journalctl -u app -n 100, then journalctl -u app -b -1 if boot failure.
Networking & Connectivity
[iproute2 (ss/ip)]| Command | Example |
|---|---|
ss -tulnp / ss -tnp | Listening ports / all TCP connections |
ping host / traceroute host | Basic connectivity / show route to destination |
dig +short host / dig host MX | DNS lookup / specific record types |
nc -zv host port | Firewall test (is port open/reachable?) |
curl -w "%{time_total}s" -o /dev/null | API latency (DNS + TLS + server); curl -w "@-" for breakdown |
curl -X POST -d '{"key":"val"}' | POST JSON to API |
ip addr show | Local IPs and network interfaces |
SSH & Permissions
| Command | Example |
|---|---|
ssh -L 5432:db:5432 jump | Local port forward through jump host |
ssh -J jump prod / ssh -D 1080 jump | Jump directly / SOCKS5 proxy |
ssh-keygen -t ed25519 | Generate modern SSH key |
ssh-copy-id -i ~/.ssh/key.pub host | Install public key (passwordless login) |
chmod 755 / chmod 644 / chmod 600 | exec / read-write / private key permissions |
chown user:group file | Change owner:group |
Firewall & Scheduling
| Command | Example |
|---|---|
ufw allow 443/tcp / ufw status | Ubuntu/Debian firewall |
firewall-cmd --permanent --add-port=8080/tcp && firewall-cmd --reload | RHEL/CentOS firewall |
crontab -e / crontab -l | Edit/list cron jobs (0 2 * * * = daily 2 AM) |
systemctl list-timers | Modern systemd timers (better than cron) |
Cron gotcha: Minimal $PATH — use absolute paths.
Disk & Kernel Tuning
[Linux kernel docs]| Command | Example |
|---|---|
lsblk -f / df -h | Disk tree / filesystem usage |
du -sh /path | Directory size (find bloated dirs) |
mount /dev/sdb /mnt / mount -a | Mount disk / mount all from fstab |
sysctl net.core.somaxconn / sysctl -w net.core.somaxconn=65535 | View/set kernel parameter |
ulimit -n / cat /proc/<PID>/limits | Check max open files / per-process limits |
Key kernel tunables: net.core.somaxconn=65535, fs.file-max=2M, vm.swappiness=10
Note: Limits must match at 3 layers (limits.conf, systemd LimitNOFILE, sysctl fs.file-max); effective limit is minimum.
Advanced Debugging
[Linux kernel docs]| Command | Example |
|---|---|
strace -p <PID> / strace -e trace=network | Trace syscalls / network only |
strace -c -p <PID> | Count syscalls (Ctrl+C for summary) |
lsof -p <PID> / lsof -i :8080 | What's open / which process owns port? |
perf record -g -p <PID> -- sleep 30 && perf report | CPU profiling with flamegraph |
tcpdump -i any port 5432 -w /tmp/cap.pcap | Capture packets (open in Wireshark) |
openssl s_client -connect host:443 | Check TLS cert expiry |
tar -czf backup.tar.gz /data && tar -xzf backup.tar.gz | Archive/decompress |
set -euo pipefail | Shell safety (use at top of all scripts) |
When to use: strace = syscalls. lsof = open state. perf = CPU hotspots. tcpdump = packets.
When to use what
| Task | Modern Choice | Legacy Alternative | When Legacy Wins |
|---|---|---|---|
| Search files for text | ripgrep (rg) | grep -r | grep is built-in; rg requires installation; grep is sufficient for small codebases |
| Find files by name | fd | find | fd is faster; find is built-in; find works when fd not installed |
| Locate file by name (indexed) | locate | updatedb && locate | locate is fast if DB updated; doesn't work if file added today |
| Monitor processes live | htop or btop | top | htop/btop have better UI; top is always available; use top in scripts |
| Monitor resources interactively | btop | top | btop prettier, more intuitive; top is universally available |
| HTTP requests from CLI | httpie (http) | curl | httpie syntax cleaner; curl is always installed; use curl for scripts and automation |
| Manage systemd services | systemctl | service | systemctl standard; service still works (translates to systemctl); both equivalent |
| View service logs | journalctl -u service -f | tail -f /var/log/... | journalctl structured; older systems use syslog files; journalctl is standard on modern Linux |
| Connection inspection | ss -tnp | netstat -tulnp | ss is faster, netstat deprecated; netstat still works on legacy systems |
| Process inspection by port | lsof -i :8080 | netstat -tulnp | grep 8080 | lsof is clearer; netstat more portable to BSD; both work |
fd note: On Debian/Ubuntu the apt package is fd-find and the binary installs as fdfind (the fd name collides with an existing package). Run it as fdfind, or alias it: ln -s "$(command -v fdfind)" ~/.local/bin/fd.
Gotchas that bite in production
-
rm -rf $VARwhen variable is empty expands torm -rf /(deletes everything)- You're cleaning up temp files:
rm -rf /tmp/build-$BUILD_ID. VariableBUILD_IDis empty (unset). Command becomesrm -rf /tmp/build-/then expands to parent dirs. In 5 seconds,/is gone. Server is a brick. - Fix: Always use
rm -rf "${VAR:?error if unset}"which fails with "error if unset" if variable is missing. Or useset -uat the top of scripts to error on undefined variables. Check before deletion:test -n "$VAR" && rm -rf /tmp/build-$VAR.
- You're cleaning up temp files:
-
OOM-killed process is silent; exits with code 137 but logs nothing
- App is leaking memory, kernel OOM-killer terminates it. No stack trace, no "OutOfMemory" message. Process just dies. Systemd restarts it silently. You see exit code 137 in logs but assume it's a crash.
- Fix: Check kernel logs:
dmesg | grep -i oomorjournalctl -k | grep oom. Set memory limits lower than peak so app fails fast and visibly:systemctl set-property myapp.service MemoryLimit=512M. Monitorfree -handps aux --sort=-%memduring load testing to catch leaks early.
-
Cron jobs fail silently because
$PATHis minimal (missing/usr/local/bin)- You add a cron:
0 2 * * * /usr/local/bin/backup.sh. Cron's$PATHis/usr/bin:/bin. Script callsjq(installed at/usr/local/bin/jq). Cron job silently fails. Backups never run for 2 weeks. You only notice during audit. - Fix: Use absolute paths in scripts:
/usr/local/bin/jqnotjq. Or setPATHat the top of crontab:PATH=/usr/local/bin:/usr/bin:/bin. Test cron locally:env -i HOME=$HOME /usr/bin/crontab -l | /usr/bin/run-parts - 2>&1to simulate cron's minimal environment.
- You add a cron:
-
ulimit changes don't stick; limits conflict at 3 layers (limits.conf, systemd, sysctl)
- You set
ulimit -n 65535in a shell. It works. You restart the service. Back to default 1024. You edit/etc/security/limits.conf, restart, still 1024. The effective limit is the minimum across all 3: limits.conf, systemdLimitNOFILE, and sysctlfs.file-max. - Fix: Change all 3 layers together:
echo "fs.file-max = 2M" >> /etc/sysctl.conf && sysctl -p, then/etc/security/limits.conf: app soft nofile 65535 app hard nofile 65535, then systemd:systemctl set-property myapp.service LimitNOFILE=65535. Verify:cat /proc/<PID>/limitsshows actual limits in effect.
- You set
-
SSH key with permissions > 600 is silently rejected
- You copy private key to a new machine with default
umask 0022. Key is-rw-r--r--(644). SSH says "Permissions too open on /home/user/.ssh/id_ed25519, ignoring key file." You spend 1 hour thinking there's a key mismatch or agent issue. - Fix: Always
chmod 600 ~/.ssh/id_*after copying keys. Or set umask before copying:umask 077 && cp key ~/.ssh/. SSH is paranoid — it won't use keys that anyone else can read (group/world readable).
- You copy private key to a new machine with default
-
Port is listening but unreachable from outside; firewall check fails in one direction
- Service listens on port 8080:
ss -tlnp | grep 8080shows it. Externalnc -zv host 8080times out. You think it's a network issue. Actually, you checked the inbound firewall but not the outbound allow on the source machine. - Fix: Check both directions:
sudo ufw allow in 8080/tcp && sudo ufw allow out 8080/tcp(Ubuntu) orfirewall-cmd(RHEL). From another machine:nc -zv -w 5 host 8080. If it times out, check the firewall on BOTH the target (inbound rule) and source (outbound rule); many networks restrict outbound ports.
- Service listens on port 8080:
Common Gotchas
rm -rf $DIRwith empty variable — expands torm -rf /. Always verify the variable.- OOM-killed process (silent) — check
dmesg | grep oom— kernel doesn't log to app logs. - Cron "command not found" — minimal
$PATHin cron. Use absolute paths. - ulimit doesn't stick — check 3 layers: limits.conf, systemd LimitNOFILE, sysctl fs.file-max. Minimum wins.
- SSH key permissions > 600 — SSH refuses them.
chmod 600 ~/.ssh/id_ed25519. - Port open but unreachable — check firewall both directions (inbound and outbound).
Performance triage in 90 seconds
When the on-call alert hits, you have time for one connect, one paste, one decision. The five commands below answer the questions "is the box healthy" and "where is the bottleneck" before you've finished reading the alert. Run them top-to-bottom, look at the output once, and route to the right deep dive.
# 1. Top — what is hot right now (sorted by CPU, refreshed every second)
top -b -n 1 -o %CPU | head -20
# 2. vmstat — run-queue depth, context switches, swap activity (5 samples, 1s apart)
vmstat 1 5
# 3. iostat — per-device queue depth, await, %util (the disk truth)
iostat -xz 1 3
# 4. ss — listen backlog, established/time-wait counts, retransmits
ss -s && ss -tnp state established | head
# 5. dmesg — last 30 kernel events: OOM-killer, segfaults, NIC resets, EXT4 errors
dmesg -T --level=err,warn | tail -30What each line tells you: top reveals the heavy process and load average — load > cores means CPU saturation. vmstat's r column is the run-queue depth (anything sustained over nproc is CPU-starved); the si/so columns being non-zero means the box is swapping and you should stop reading and reboot or kill the leaker. iostat's %util near 100 with await over 50ms is disk-bound; r_await > w_await points at read-heavy hotspots. ss -s shows the socket summary — if synrecv or timewait is unbounded you are TCP-saturated, not application-slow. dmesg is the only place the kernel admits it killed your process.
perf for production CPU profiles
When top blames a process but you need to know which function inside it is burning CPU, perf samples the running stack at 99 Hz and produces a flame graph that reads top-down: wide bars are hot, deep stacks are call paths. The workflow below profiles PID 1234 for 30 seconds, then renders SVG you can open in any browser.
# Capture stack samples (-g = call graphs, -F 99 = 99 Hz, --all-user = skip kernel time)
sudo perf record -F 99 -g --all-user -p 1234 -- sleep 30
# Fold stacks into the format flamegraph.pl expects
sudo perf script | /opt/FlameGraph/stackcollapse-perf.pl > out.folded
# Render the SVG (one HTML file, no JS server needed)
/opt/FlameGraph/flamegraph.pl out.folded > flame.svgWide plateaus near the top are the functions actually consuming CPU; tall narrow towers are deep call chains that are not hot. Add -e cycles:pp for precise sampling on Intel skid-prone loops, and --call-graph dwarf if your binaries lack frame pointers (Go binaries built before 1.20 are the usual offender).
eBPF tracing without recompiling
bpftrace attaches probes to live kernel and userspace functions without restarting the process — so you can answer "which file is the slow read" or "which syscalls is PID 1234 actually making" on a production box that you cannot recompile. The two one-liners below cover the most common questions.
# Histogram of read() latency by PID — find the slow reader
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read /@start[tid]/
{ @ns[pid, comm] = hist(nsecs - @start[tid]); delete(@start[tid]); }'
# Block I/O latency by device — is the SSD actually slow, or is queueing the problem?
sudo bpftrace -e 'tracepoint:block:block_rq_issue { @start[args->dev] = nsecs; }
tracepoint:block:block_rq_complete /@start[args->dev]/
{ @us[args->dev] = hist((nsecs - @start[args->dev]) / 1000); }'Press Ctrl+C and bpftrace prints power-of-two histograms — the bucket where most events land is your real latency, not the average. Overhead is sub-1% on modern kernels (5.4+) and the probes detach cleanly on exit, which is why this replaced strace -c for production tracing. [Linux kernel docs]
Frequently Asked Questions
What's filling the disk?
Run du -sh /* to find the fat directory, then du -sh /path/* recursively to drill in. Typical culprits are /var/log, /var/cache, and Docker's /var/lib/docker.
How do I kill a zombie process?
You can't kill a zombie directly — they're waiting for their parent to reap them. Find the parent with ps -o ppid= -p <ZOMBIE>, then kill -9 <PARENT> to force the parent to exit (init then reaps the orphaned zombie).
Is a port open from outside the host?
On the server: ss -tlnp | grep :8080 shows whether anything is listening. From outside: nc -zv host 8080 probes connectivity. If the server listens but the probe fails, check the firewall in both directions.
Keep Reading
- Essential Docker Commands Cheat Sheet — Container lifecycle and debugging.
- Essential Kubernetes Commands Cheat Sheet — When the burning service is running on Kubernetes, route through
kubectlfirst. - The 3 Pillars of Observability — Metrics, logs, traces when CLI tools aren't enough.
- SRE: SLOs, SLIs, and Error Budgets — When triage is over, the next question is "did this burn an error budget."
- Go Graceful Shutdown in Production — The application-side counterpart to
journalctl -u svc -bafter a crash loop.
Engineering Team
A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.
Read Next
Essential Kubernetes Commands: The Complete kubectl Cheat Sheet
Definitive kubectl reference: pod debugging, deployments, StatefulSets, RBAC, scheduling, Helm, and production troubleshooting flowcharts.
Essential Docker Commands: The Complete Cheat Sheet
Docker reference: container lifecycle, image management, volumes, networking, and debugging tools for production systems.
Essential Git Commands: The Complete Developer Cheat Sheet
Production Git reference: core workflow, branching, history rewriting, recovery, and advanced operations compressed into lookup tables.