Skip to content

Essential Linux Commands: A Backend Engineer's Cheat Sheet

BackendBytes Engineering Team
BackendBytes Engineering Team
5 min read
Essential Linux Commands: A Backend Engineer's Cheat Sheet

Key Takeaways

  • `ss -tnp | wc -l` revealed 48,000 orphaned TCP connections when latency spiked from 50ms to 2,000ms — higher than expected connection counts hide exhausted pools and leaked sockets
  • `lsof -a -i -p <PID>` shows what file descriptors and sockets a process holds — a missing `Close()` call leaks connections; seeing 5,000+ sockets per PID confirms it
  • `curl -w "%{time_total}s"` breaks down API latency into DNS + TLS + server time — dashboard shows 200ms, curl shows 180ms TLS handshake revealing the real bottleneck
  • `journalctl -u <service> -b -1` shows logs from the previous boot — when a service fails during startup, current logs are empty; boot-1 contains the actual error
  • `dmesg | grep -i oom` or `journalctl -k` catches kernel OOM-killer activity that application logs never mention — the process was forcibly killed, not error-logged

Production Debugging: When You SSH Into a Burning Server

An e-commerce API's latency spiked 50ms → 2,000ms. Dashboards showed nothing wrong. One ss -tnp | wc -l: 48,000 TCP connections. One lsof -i command revealed a missing Close() call in a recent deploy. Finding it required knowing which Linux commands to reach for.

TL;DR
  • ss -tnp — TCP connections with process owner (reveals hung clients, exhausted pools)
  • lsof -a -i -p <PID> — what files/sockets is the process holding?
  • top -o %CPU — which process is the CPU hog?
  • journalctl -u <service> -n 50 — what did the service log?
  • curl -w "%{time_total}s" — API latency breakdown (DNS + TLS + server)

Triage by Symptom

When you SSH into a burning server, the question is "where is the bottleneck" — not "what does each command do." Route by symptom, not by command name:

graph TD
    Burning[Server is on fire] --> Sym{What is<br/>the symptom?}
    Sym -->|High CPU| CPU[top -o %CPU<br/>+ pidstat -u 1<br/>+ perf top]
    Sym -->|High memory| Mem[free -h<br/>+ ps aux --sort=-%mem<br/>+ smem]
    Sym -->|Disk full or slow| Disk[df -h<br/>+ du -sh */<br/>+ iostat -x 1]
    Sym -->|Network slow or stuck| Net[ss -tnp<br/>+ ss -s<br/>+ tcpdump -i any port X]
    Sym -->|Process stuck| Proc[strace -p PID<br/>+ lsof -p PID<br/>+ cat /proc/PID/wchan]
    Sym -->|Service down| Svc[systemctl status<br/>+ journalctl -u svc -b<br/>+ journalctl -p err]
    CPU -->|Single hot core| Hot[mpstat -P ALL 1<br/>find which thread]
    Mem -->|OOMKilled| OOM["dmesg | grep -i kill<br/>find which proc died"]
    Net -->|TCP exhaustion| TCP[ss -s shows TIME-WAIT spike<br/>fix: close, pool, or kernel knobs]
    style Burning fill:#fdd
    style OOM fill:#fdd
    style TCP fill:#fdd
    style Hot fill:#ffd

The diagram is the entire production-debugging discipline in one picture: classify the symptom, pick three commands, never start with "let me run htop and see what's happening."

The Essential Dozen

CommandUse Case
ss -tnpTCP connections by process; ss -tnp | grep ESTABLISHED
lsof -i :8080What process owns a port?
top -o %CPU / htopCPU/memory per process
journalctl -u nginx -n 50Service logs
grep "ERROR" app.log | wc -lCount matching lines
awk '{print $1}' file | sort | uniq -cFrequency analysis
find / -size +100M 2>/dev/nullLarge files
du -sh /path / df -hDisk usage
curl -w "%{time_total}s"API latency
nc -zv host portFirewall test
ssh -L 5432:db:5432 jumpPort forward
dmesg | grep -i oomOOM kill check

File & Text Inspection

[Linux kernel docs]
CommandExample
ls -lt / stat /pathls -lt → sort by modification time; stat /etc/nginx.conf → metadata
find -name "*.log" -mtime -1 / find -size +100MFind files by name/date/size
tail -f /var/log/app.log / lessFollow logs / pagewise view
grep "ERROR" app.log | wc -lCount lines matching pattern
awk '{print $1, $3}' / cut -d',' -f1,3Extract columns from text
sort | uniq -c | sort -rnDeduplicate and rank by frequency
sed 's/old/new/g'Stream replace (use sed -i.bak to edit in-place)
jq '.field' / jq '.data[] | .id'Parse JSON on CLI

Process & Performance Monitoring

[Linux kernel docs]

The pid → process tree → resource view in one diagram — every triage starts at one of these vertices:

graph TB
    PID[PID 1234] --> Top[top / htop<br/>CPU + memory ranking]
    PID --> Files[lsof -p 1234<br/>open fds, sockets, files]
    PID --> Stack[strace / py-spy / pprof<br/>syscall + stack profile]
    PID --> Tree[pstree -p 1234<br/>parent + children + threads]
    Tree --> Children[Child PIDs<br/>recursive triage]
    System[System-wide<br/>aggregates] --> VM[vmstat 1<br/>cpu, mem, swap, io]
    System --> Free[free -h<br/>RAM + swap totals]
    System --> Disk[df -h<br/>du -sh path<br/>iostat -x]
    System --> Kern[dmesg + journalctl -k<br/>OOM-killer, page faults,<br/>kernel panics]
    Kern -.->|grep -i 'killed process'| OOM[OOM-killer victim<br/>+ rss at kill time]
    style OOM fill:#fdd
    style Top fill:#dfd
    style VM fill:#dfd
CommandExample
top -o %CPU / htop / ps aux --sort=-%cpuFind CPU hogs; htop has better UI
pgrep nginx / pidof nginxFind process by name for killing/tracing
lsof -p <PID> / lsof -i :8080What's open / which process owns a port?
vmstat 1 10 / free -h / iostat -xMemory pressure, swap, I/O load
df -h / du -sh /pathFilesystem full? Which directory is large?
dmesg | grep -i oom / journalctl -kKernel logs (OOM-killer, crashes)

Service Management (systemd & journalctl)

[systemd manual]
CommandExample
systemctl status <service>Status and recent logs; systemctl restart nginx to cycle
systemctl enable/disable <service>Start on boot / don't start on boot
journalctl -u <service> -n 50 / journalctl -u <service> -fLast 50 lines / follow logs live
journalctl -u <service> -b -1Logs from previous boot (diagnose startup failures)
journalctl -k / dmesgKernel messages (OOM, driver crashes)
systemctl --failedList crashed services

Crashed service: check systemctl status app, then journalctl -u app -n 100, then journalctl -u app -b -1 if boot failure.

Networking & Connectivity

[iproute2 (ss/ip)]
CommandExample
ss -tulnp / ss -tnpListening ports / all TCP connections
ping host / traceroute hostBasic connectivity / show route to destination
dig +short host / dig host MXDNS lookup / specific record types
nc -zv host portFirewall test (is port open/reachable?)
curl -w "%{time_total}s" -o /dev/nullAPI latency (DNS + TLS + server); curl -w "@-" for breakdown
curl -X POST -d '{"key":"val"}'POST JSON to API
ip addr showLocal IPs and network interfaces

SSH & Permissions

CommandExample
ssh -L 5432:db:5432 jumpLocal port forward through jump host
ssh -J jump prod / ssh -D 1080 jumpJump directly / SOCKS5 proxy
ssh-keygen -t ed25519Generate modern SSH key
ssh-copy-id -i ~/.ssh/key.pub hostInstall public key (passwordless login)
chmod 755 / chmod 644 / chmod 600exec / read-write / private key permissions
chown user:group fileChange owner:group

Firewall & Scheduling

CommandExample
ufw allow 443/tcp / ufw statusUbuntu/Debian firewall
firewall-cmd --permanent --add-port=8080/tcp && firewall-cmd --reloadRHEL/CentOS firewall
crontab -e / crontab -lEdit/list cron jobs (0 2 * * * = daily 2 AM)
systemctl list-timersModern systemd timers (better than cron)

Cron gotcha: Minimal $PATH — use absolute paths.

Disk & Kernel Tuning

[Linux kernel docs]
CommandExample
lsblk -f / df -hDisk tree / filesystem usage
du -sh /pathDirectory size (find bloated dirs)
mount /dev/sdb /mnt / mount -aMount disk / mount all from fstab
sysctl net.core.somaxconn / sysctl -w net.core.somaxconn=65535View/set kernel parameter
ulimit -n / cat /proc/<PID>/limitsCheck max open files / per-process limits

Key kernel tunables: net.core.somaxconn=65535, fs.file-max=2M, vm.swappiness=10

Note: Limits must match at 3 layers (limits.conf, systemd LimitNOFILE, sysctl fs.file-max); effective limit is minimum.

Advanced Debugging

[Linux kernel docs]
CommandExample
strace -p <PID> / strace -e trace=networkTrace syscalls / network only
strace -c -p <PID>Count syscalls (Ctrl+C for summary)
lsof -p <PID> / lsof -i :8080What's open / which process owns port?
perf record -g -p <PID> -- sleep 30 && perf reportCPU profiling with flamegraph
tcpdump -i any port 5432 -w /tmp/cap.pcapCapture packets (open in Wireshark)
openssl s_client -connect host:443Check TLS cert expiry
tar -czf backup.tar.gz /data && tar -xzf backup.tar.gzArchive/decompress
set -euo pipefailShell safety (use at top of all scripts)

When to use: strace = syscalls. lsof = open state. perf = CPU hotspots. tcpdump = packets.

When to use what

TaskModern ChoiceLegacy AlternativeWhen Legacy Wins
Search files for textripgrep (rg)grep -rgrep is built-in; rg requires installation; grep is sufficient for small codebases
Find files by namefdfindfd is faster; find is built-in; find works when fd not installed
Locate file by name (indexed)locateupdatedb && locatelocate is fast if DB updated; doesn't work if file added today
Monitor processes livehtop or btoptophtop/btop have better UI; top is always available; use top in scripts
Monitor resources interactivelybtoptopbtop prettier, more intuitive; top is universally available
HTTP requests from CLIhttpie (http)curlhttpie syntax cleaner; curl is always installed; use curl for scripts and automation
Manage systemd servicessystemctlservicesystemctl standard; service still works (translates to systemctl); both equivalent
View service logsjournalctl -u service -ftail -f /var/log/...journalctl structured; older systems use syslog files; journalctl is standard on modern Linux
Connection inspectionss -tnpnetstat -tulnpss is faster, netstat deprecated; netstat still works on legacy systems
Process inspection by portlsof -i :8080netstat -tulnp | grep 8080lsof is clearer; netstat more portable to BSD; both work

fd note: On Debian/Ubuntu the apt package is fd-find and the binary installs as fdfind (the fd name collides with an existing package). Run it as fdfind, or alias it: ln -s "$(command -v fdfind)" ~/.local/bin/fd.

Gotchas that bite in production

  1. rm -rf $VAR when variable is empty expands to rm -rf / (deletes everything)

    • You're cleaning up temp files: rm -rf /tmp/build-$BUILD_ID. Variable BUILD_ID is empty (unset). Command becomes rm -rf /tmp/build-/ then expands to parent dirs. In 5 seconds, / is gone. Server is a brick.
    • Fix: Always use rm -rf "${VAR:?error if unset}" which fails with "error if unset" if variable is missing. Or use set -u at the top of scripts to error on undefined variables. Check before deletion: test -n "$VAR" && rm -rf /tmp/build-$VAR.
  2. OOM-killed process is silent; exits with code 137 but logs nothing

    • App is leaking memory, kernel OOM-killer terminates it. No stack trace, no "OutOfMemory" message. Process just dies. Systemd restarts it silently. You see exit code 137 in logs but assume it's a crash.
    • Fix: Check kernel logs: dmesg | grep -i oom or journalctl -k | grep oom. Set memory limits lower than peak so app fails fast and visibly: systemctl set-property myapp.service MemoryLimit=512M. Monitor free -h and ps aux --sort=-%mem during load testing to catch leaks early.
  3. Cron jobs fail silently because $PATH is minimal (missing /usr/local/bin)

    • You add a cron: 0 2 * * * /usr/local/bin/backup.sh. Cron's $PATH is /usr/bin:/bin. Script calls jq (installed at /usr/local/bin/jq). Cron job silently fails. Backups never run for 2 weeks. You only notice during audit.
    • Fix: Use absolute paths in scripts: /usr/local/bin/jq not jq. Or set PATH at the top of crontab: PATH=/usr/local/bin:/usr/bin:/bin. Test cron locally: env -i HOME=$HOME /usr/bin/crontab -l | /usr/bin/run-parts - 2>&1 to simulate cron's minimal environment.
  4. ulimit changes don't stick; limits conflict at 3 layers (limits.conf, systemd, sysctl)

    • You set ulimit -n 65535 in a shell. It works. You restart the service. Back to default 1024. You edit /etc/security/limits.conf, restart, still 1024. The effective limit is the minimum across all 3: limits.conf, systemd LimitNOFILE, and sysctl fs.file-max.
    • Fix: Change all 3 layers together: echo "fs.file-max = 2M" >> /etc/sysctl.conf && sysctl -p, then /etc/security/limits.conf: app soft nofile 65535 app hard nofile 65535, then systemd: systemctl set-property myapp.service LimitNOFILE=65535. Verify: cat /proc/<PID>/limits shows actual limits in effect.
  5. SSH key with permissions > 600 is silently rejected

    • You copy private key to a new machine with default umask 0022. Key is -rw-r--r-- (644). SSH says "Permissions too open on /home/user/.ssh/id_ed25519, ignoring key file." You spend 1 hour thinking there's a key mismatch or agent issue.
    • Fix: Always chmod 600 ~/.ssh/id_* after copying keys. Or set umask before copying: umask 077 && cp key ~/.ssh/. SSH is paranoid — it won't use keys that anyone else can read (group/world readable).
  6. Port is listening but unreachable from outside; firewall check fails in one direction

    • Service listens on port 8080: ss -tlnp | grep 8080 shows it. External nc -zv host 8080 times out. You think it's a network issue. Actually, you checked the inbound firewall but not the outbound allow on the source machine.
    • Fix: Check both directions: sudo ufw allow in 8080/tcp && sudo ufw allow out 8080/tcp (Ubuntu) or firewall-cmd (RHEL). From another machine: nc -zv -w 5 host 8080. If it times out, check the firewall on BOTH the target (inbound rule) and source (outbound rule); many networks restrict outbound ports.

Common Gotchas

  • rm -rf $DIR with empty variable — expands to rm -rf /. Always verify the variable.
  • OOM-killed process (silent) — check dmesg | grep oom — kernel doesn't log to app logs.
  • Cron "command not found" — minimal $PATH in cron. Use absolute paths.
  • ulimit doesn't stick — check 3 layers: limits.conf, systemd LimitNOFILE, sysctl fs.file-max. Minimum wins.
  • SSH key permissions > 600 — SSH refuses them. chmod 600 ~/.ssh/id_ed25519.
  • Port open but unreachable — check firewall both directions (inbound and outbound).

Performance triage in 90 seconds

When the on-call alert hits, you have time for one connect, one paste, one decision. The five commands below answer the questions "is the box healthy" and "where is the bottleneck" before you've finished reading the alert. Run them top-to-bottom, look at the output once, and route to the right deep dive.

# 1. Top — what is hot right now (sorted by CPU, refreshed every second)
top -b -n 1 -o %CPU | head -20
 
# 2. vmstat — run-queue depth, context switches, swap activity (5 samples, 1s apart)
vmstat 1 5
 
# 3. iostat — per-device queue depth, await, %util (the disk truth)
iostat -xz 1 3
 
# 4. ss — listen backlog, established/time-wait counts, retransmits
ss -s && ss -tnp state established | head
 
# 5. dmesg — last 30 kernel events: OOM-killer, segfaults, NIC resets, EXT4 errors
dmesg -T --level=err,warn | tail -30

What each line tells you: top reveals the heavy process and load average — load > cores means CPU saturation. vmstat's r column is the run-queue depth (anything sustained over nproc is CPU-starved); the si/so columns being non-zero means the box is swapping and you should stop reading and reboot or kill the leaker. iostat's %util near 100 with await over 50ms is disk-bound; r_await > w_await points at read-heavy hotspots. ss -s shows the socket summary — if synrecv or timewait is unbounded you are TCP-saturated, not application-slow. dmesg is the only place the kernel admits it killed your process.

perf for production CPU profiles

When top blames a process but you need to know which function inside it is burning CPU, perf samples the running stack at 99 Hz and produces a flame graph that reads top-down: wide bars are hot, deep stacks are call paths. The workflow below profiles PID 1234 for 30 seconds, then renders SVG you can open in any browser.

# Capture stack samples (-g = call graphs, -F 99 = 99 Hz, --all-user = skip kernel time)
sudo perf record -F 99 -g --all-user -p 1234 -- sleep 30
 
# Fold stacks into the format flamegraph.pl expects
sudo perf script | /opt/FlameGraph/stackcollapse-perf.pl > out.folded
 
# Render the SVG (one HTML file, no JS server needed)
/opt/FlameGraph/flamegraph.pl out.folded > flame.svg

Wide plateaus near the top are the functions actually consuming CPU; tall narrow towers are deep call chains that are not hot. Add -e cycles:pp for precise sampling on Intel skid-prone loops, and --call-graph dwarf if your binaries lack frame pointers (Go binaries built before 1.20 are the usual offender).

eBPF tracing without recompiling

bpftrace attaches probes to live kernel and userspace functions without restarting the process — so you can answer "which file is the slow read" or "which syscalls is PID 1234 actually making" on a production box that you cannot recompile. The two one-liners below cover the most common questions.

# Histogram of read() latency by PID — find the slow reader
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
  tracepoint:syscalls:sys_exit_read /@start[tid]/
  { @ns[pid, comm] = hist(nsecs - @start[tid]); delete(@start[tid]); }'
 
# Block I/O latency by device — is the SSD actually slow, or is queueing the problem?
sudo bpftrace -e 'tracepoint:block:block_rq_issue { @start[args->dev] = nsecs; }
  tracepoint:block:block_rq_complete /@start[args->dev]/
  { @us[args->dev] = hist((nsecs - @start[args->dev]) / 1000); }'

Press Ctrl+C and bpftrace prints power-of-two histograms — the bucket where most events land is your real latency, not the average. Overhead is sub-1% on modern kernels (5.4+) and the probes detach cleanly on exit, which is why this replaced strace -c for production tracing. [Linux kernel docs]

Frequently Asked Questions

What's filling the disk?

Run du -sh /* to find the fat directory, then du -sh /path/* recursively to drill in. Typical culprits are /var/log, /var/cache, and Docker's /var/lib/docker.

How do I kill a zombie process?

You can't kill a zombie directly — they're waiting for their parent to reap them. Find the parent with ps -o ppid= -p <ZOMBIE>, then kill -9 <PARENT> to force the parent to exit (init then reaps the orphaned zombie).

Is a port open from outside the host?

On the server: ss -tlnp | grep :8080 shows whether anything is listening. From outside: nc -zv host 8080 probes connectivity. If the server listens but the probe fails, check the firewall in both directions.

Keep Reading

BackendBytes Engineering Team
BackendBytes Engineering Team

Engineering Team

A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.

Read Next