Latency spiked on a production service. What's the first command to run?

Run ss -tnp | wc -l to count TCP connections. If the count is abnormally high (10k+ when expecting 1k), you have a connection leak. Then lsof -a -i -p shows what sockets the process holds.

How do I find which process is using a specific port?

lsof -i :8080 shows all processes listening on or connected to port 8080. ss -tnp | grep :8080 is faster. Combine with kill -9 only after confirming it's safe to kill.

My application logs are clean but the process crashed. Where are the error logs?

Check journalctl -u myapp -n 50 for systemd logs. For startup crashes, use journalctl -u myapp -b -1 to see logs from the previous boot. dmesg | grep -i oom catches kernel OOM-killer events that applications never log.

How do I track down which file is consuming all disk space?

Use du -sh /* to list top-level directory sizes. Then drill down: du -sh /var/log/* to find the culprit. find / -size +500M 2>/dev/null lists large files system-wide (slow on large disks).

I need to debug an API call that's taking too long. How do I measure each component?

curl -w '@-' -o /dev/null -s <<'EOF' time_total: %{time_total} time_connect: %{time_connect} time_appconnect: %{time_appconnect} time_starttransfer: %{time_starttransfer} EOF breaks down DNS lookup, TLS, and server response time.

#linux #command-line #devops #cheat-sheet #system-administration

Essential Linux Commands: A Backend Engineer's Cheat Sheet

BackendBytes Engineering Team

Feb 10, 2026

16 min read

Essential Linux Commands: A Backend Engineer's Cheat Sheet

Key Takeaways

→`ss -tnp | wc -l` revealed 48,000 orphaned TCP connections when latency spiked from 50ms to 2,000ms — higher than expected connection counts hide exhausted pools and leaked sockets
→`lsof -a -i -p <PID>` shows what file descriptors and sockets a process holds — a missing `Close()` call leaks connections; seeing 5,000+ sockets per PID confirms it
→`curl -w "%{time_total}s"` breaks down API latency into DNS + TLS + server time — dashboard shows 200ms, curl shows 180ms TLS handshake revealing the real bottleneck
→`journalctl -u <service> -b -1` shows logs from the previous boot — when a service fails during startup, current logs are empty; boot-1 contains the actual error
→`dmesg | grep -i oom` or `journalctl -k` catches kernel OOM-killer activity that application logs never mention — the process was forcibly killed, not error-logged

Production Debugging: When You SSH Into a Burning Server

An e-commerce API's latency spiked 50ms → 2,000ms. Dashboards showed nothing wrong. One ss -tnp | wc -l: 48,000 TCP connections. One lsof -i command revealed a missing Close() call in a recent deploy. Finding it required knowing which Linux commands to reach for.

TL;DR

ss -tnp — TCP connections with process owner (reveals hung clients, exhausted pools)
lsof -a -i -p <PID> — what files/sockets is the process holding?
top -o %CPU — which process is the CPU hog?
journalctl -u <service> -n 50 — what did the service log?
curl -w "%{time_total}s" — API latency breakdown (DNS + TLS + server)

Triage by Symptom

When you SSH into a burning server, the question is "where is the bottleneck" — not "what does each command do." Route by symptom, not by command name:

graph TD
    Burning[Server is on fire] --> Sym{What is<br/>the symptom?}
    Sym -->|High CPU| CPU[top -o %CPU<br/>+ pidstat -u 1<br/>+ perf top]
    Sym -->|High memory| Mem[free -h<br/>+ ps aux --sort=-%mem<br/>+ smem]
    Sym -->|Disk full or slow| Disk[df -h<br/>+ du -sh */<br/>+ iostat -x 1]
    Sym -->|Network slow or stuck| Net[ss -tnp<br/>+ ss -s<br/>+ tcpdump -i any port X]
    Sym -->|Process stuck| Proc[strace -p PID<br/>+ lsof -p PID<br/>+ cat /proc/PID/wchan]
    Sym -->|Service down| Svc[systemctl status<br/>+ journalctl -u svc -b<br/>+ journalctl -p err]
    CPU -->|Single hot core| Hot[mpstat -P ALL 1<br/>find which thread]
    Mem -->|OOMKilled| OOM["dmesg | grep -i kill<br/>find which proc died"]
    Net -->|TCP exhaustion| TCP[ss -s shows TIME-WAIT spike<br/>fix: close, pool, or kernel knobs]
    style Burning fill:#fdd
    style OOM fill:#fdd
    style TCP fill:#fdd
    style Hot fill:#ffd

The diagram is the entire production-debugging discipline in one picture: classify the symptom, pick three commands, never start with "let me run htop and see what's happening."

The Essential Dozen

Command	Use Case
`ss -tnp`	TCP connections by process; `ss -tnp \| grep ESTABLISHED`
`lsof -i :8080`	What process owns a port?
`top -o %CPU` / `htop`	CPU/memory per process
`journalctl -u nginx -n 50`	Service logs
`grep "ERROR" app.log \| wc -l`	Count matching lines
`awk '{print $1}' file \| sort \| uniq -c`	Frequency analysis
`find / -size +100M 2>/dev/null`	Large files
`du -sh /path` / `df -h`	Disk usage
`curl -w "%{time_total}s"`	API latency
`nc -zv host port`	Firewall test
`ssh -L 5432:db:5432 jump`	Port forward
`dmesg \| grep -i oom`	OOM kill check

File & Text Inspection

^{[Linux kernel docs]}

Command	Example
`ls -lt` / `stat /path`	`ls -lt` → sort by modification time; `stat /etc/nginx.conf` → metadata
*`find -name ".log" -mtime -1` / `find -size +100M`**	Find files by name/date/size
`tail -f /var/log/app.log` / `less`	Follow logs / pagewise view
`grep "ERROR" app.log \| wc -l`	Count lines matching pattern
`awk '{print $1, $3}'` / `cut -d',' -f1,3`	Extract columns from text
`sort \| uniq -c \| sort -rn`	Deduplicate and rank by frequency
`sed 's/old/new/g'`	Stream replace (use `sed -i.bak` to edit in-place)
`jq '.field'` / `jq '.data[] \| .id'`	Parse JSON on CLI

Process & Performance Monitoring

^{[Linux kernel docs]}

The pid → process tree → resource view in one diagram — every triage starts at one of these vertices:

graph TB
    PID[PID 1234] --> Top[top / htop<br/>CPU + memory ranking]
    PID --> Files[lsof -p 1234<br/>open fds, sockets, files]
    PID --> Stack[strace / py-spy / pprof<br/>syscall + stack profile]
    PID --> Tree[pstree -p 1234<br/>parent + children + threads]
    Tree --> Children[Child PIDs<br/>recursive triage]
    System[System-wide<br/>aggregates] --> VM[vmstat 1<br/>cpu, mem, swap, io]
    System --> Free[free -h<br/>RAM + swap totals]
    System --> Disk[df -h<br/>du -sh path<br/>iostat -x]
    System --> Kern[dmesg + journalctl -k<br/>OOM-killer, page faults,<br/>kernel panics]
    Kern -.->|grep -i 'killed process'| OOM[OOM-killer victim<br/>+ rss at kill time]
    style OOM fill:#fdd
    style Top fill:#dfd
    style VM fill:#dfd

Command	Example
`top -o %CPU` / `htop` / `ps aux --sort=-%cpu`	Find CPU hogs; `htop` has better UI
`pgrep nginx` / `pidof nginx`	Find process by name for killing/tracing
`lsof -p <PID>` / `lsof -i :8080`	What's open / which process owns a port?
`vmstat 1 10` / `free -h` / `iostat -x`	Memory pressure, swap, I/O load
`df -h` / `du -sh /path`	Filesystem full? Which directory is large?
`dmesg \| grep -i oom` / `journalctl -k`	Kernel logs (OOM-killer, crashes)

Service Management (systemd & journalctl)

^{[systemd manual]}

Command	Example
`systemctl status <service>`	Status and recent logs; `systemctl restart nginx` to cycle
`systemctl enable/disable <service>`	Start on boot / don't start on boot
`journalctl -u <service> -n 50` / `journalctl -u <service> -f`	Last 50 lines / follow logs live
`journalctl -u <service> -b -1`	Logs from previous boot (diagnose startup failures)
`journalctl -k` / `dmesg`	Kernel messages (OOM, driver crashes)
`systemctl --failed`	List crashed services

Crashed service: check systemctl status app, then journalctl -u app -n 100, then journalctl -u app -b -1 if boot failure.

Networking & Connectivity

^{[iproute2 (ss/ip)]}

Command	Example
`ss -tulnp` / `ss -tnp`	Listening ports / all TCP connections
`ping host` / `traceroute host`	Basic connectivity / show route to destination
`dig +short host` / `dig host MX`	DNS lookup / specific record types
`nc -zv host port`	Firewall test (is port open/reachable?)
`curl -w "%{time_total}s" -o /dev/null`	API latency (DNS + TLS + server); `curl -w "@-"` for breakdown
`curl -X POST -d '{"key":"val"}'`	POST JSON to API
`ip addr show`	Local IPs and network interfaces

SSH & Permissions

Command	Example
`ssh -L 5432:db:5432 jump`	Local port forward through jump host
`ssh -J jump prod` / `ssh -D 1080 jump`	Jump directly / SOCKS5 proxy
`ssh-keygen -t ed25519`	Generate modern SSH key
`ssh-copy-id -i ~/.ssh/key.pub host`	Install public key (passwordless login)
`chmod 755` / `chmod 644` / `chmod 600`	exec / read-write / private key permissions
`chown user:group file`	Change owner:group

Firewall & Scheduling

Command	Example
`ufw allow 443/tcp` / `ufw status`	Ubuntu/Debian firewall
`firewall-cmd --permanent --add-port=8080/tcp && firewall-cmd --reload`	RHEL/CentOS firewall
`crontab -e` / `crontab -l`	Edit/list cron jobs (`0 2 * * *` = daily 2 AM)
`systemctl list-timers`	Modern systemd timers (better than cron)

Cron gotcha: Minimal $PATH — use absolute paths.

Disk & Kernel Tuning

^{[Linux kernel docs]}

Command	Example
`lsblk -f` / `df -h`	Disk tree / filesystem usage
`du -sh /path`	Directory size (find bloated dirs)
`mount /dev/sdb /mnt` / `mount -a`	Mount disk / mount all from fstab
`sysctl net.core.somaxconn` / `sysctl -w net.core.somaxconn=65535`	View/set kernel parameter
`ulimit -n` / `cat /proc/<PID>/limits`	Check max open files / per-process limits

Key kernel tunables: net.core.somaxconn=65535, fs.file-max=2097152, vm.swappiness=10

Note: Limits must match at 3 layers (limits.conf, systemd LimitNOFILE, sysctl fs.file-max); effective limit is minimum.

Advanced Debugging

^{[Linux kernel docs]}

Command	Example
`strace -p <PID>` / `strace -e trace=network`	Trace syscalls / network only
`strace -c -p <PID>`	Count syscalls (Ctrl+C for summary)
`lsof -p <PID>` / `lsof -i :8080`	What's open / which process owns port?
`perf record -g -p <PID> -- sleep 30 && perf report`	CPU profiling with flamegraph
`tcpdump -i any port 5432 -w /tmp/cap.pcap`	Capture packets (open in Wireshark)
`openssl s_client -connect host:443`	Check TLS cert expiry
`tar -czf backup.tar.gz /data && tar -xzf backup.tar.gz`	Archive/decompress
`set -euo pipefail`	Shell safety (use at top of all scripts)

When to use: strace = syscalls. lsof = open state. perf = CPU hotspots. tcpdump = packets.

When to use what

Task	Modern Choice	Legacy Alternative	When Legacy Wins
Search files for text	`ripgrep` (rg)	`grep -r`	grep is built-in; rg requires installation; grep is sufficient for small codebases
Find files by name	`fd`	`find`	fd is faster; find is built-in; find works when fd not installed
Locate file by name (indexed)	`locate`	`updatedb && locate`	locate is fast if DB updated; doesn't work if file added today
Monitor processes live	`htop` or `btop`	`top`	htop/btop have better UI; top is always available; use top in scripts
Monitor resources interactively	`btop`	`top`	btop prettier, more intuitive; top is universally available
HTTP requests from CLI	`httpie` (http)	`curl`	httpie syntax cleaner; curl is always installed; use curl for scripts and automation
Manage systemd services	`systemctl`	`service`	systemctl standard; service still works (translates to systemctl); both equivalent
View service logs	`journalctl -u service -f`	`tail -f /var/log/...`	journalctl structured; older systems use syslog files; journalctl is standard on modern Linux
Connection inspection	`ss -tnp`	`netstat -tulnp`	ss is faster, netstat deprecated; netstat still works on legacy systems
Process inspection by port	`lsof -i :8080`	`netstat -tulnp \| grep 8080`	lsof is clearer; netstat more portable to BSD; both work

fd note: On Debian/Ubuntu the apt package is fd-find and the binary installs as fdfind (the fd name collides with an existing package). Run it as fdfind, or alias it: ln -s "$(command -v fdfind)" ~/.local/bin/fd.

Gotchas that bite in production

rm -rf $VAR when variable is empty expands to rm -rf / (deletes everything)
- You're cleaning up a build workspace with rm -rf $BUILD_ROOT/. Variable BUILD_ROOT is empty (unset), so the trailing slash makes the command expand to rm -rf /. In 5 seconds, / is gone. Server is a brick. (The subtler cousin: rm -rf /tmp/build-$BUILD_ID with an unset BUILD_ID targets only a directory literally named /tmp/build- — harmless until someone moves the variable to the front of the path.)
- Fix: Always use rm -rf "${VAR:?error if unset}" which fails with "error if unset" if variable is missing. Or use set -u at the top of scripts to error on undefined variables. Check before deletion: test -n "$VAR" && rm -rf /tmp/build-$VAR.
OOM-killed process is silent; exits with code 137 but logs nothing
- App is leaking memory, kernel OOM-killer terminates it. No stack trace, no "OutOfMemory" message. Process just dies. Systemd restarts it silently. You see exit code 137 in logs but assume it's a crash.
- Fix: Check kernel logs: dmesg | grep -i oom or journalctl -k | grep oom. Set memory limits lower than peak so app fails fast and visibly: systemctl set-property myapp.service MemoryLimit=512M. Monitor free -h and ps aux --sort=-%mem during load testing to catch leaks early.
Cron jobs fail silently because $PATH is minimal (missing /usr/local/bin)
- You add a cron: 0 2 * * * /usr/local/bin/backup.sh. Cron's $PATH is /usr/bin:/bin. Script calls jq (installed at /usr/local/bin/jq). Cron job silently fails. Backups never run for 2 weeks. You only notice during audit.
- Fix: Use absolute paths in scripts: /usr/local/bin/jq not jq. Or set PATH at the top of crontab: PATH=/usr/local/bin:/usr/bin:/bin. Test cron locally: env -i HOME=$HOME /usr/bin/crontab -l | /usr/bin/run-parts - 2>&1 to simulate cron's minimal environment.
ulimit changes don't stick; limits conflict at 3 layers (limits.conf, systemd, sysctl)
- You set ulimit -n 65535 in a shell. It works. You restart the service. Back to default 1024. You edit /etc/security/limits.conf, restart, still 1024. The effective limit is the minimum across all 3: limits.conf, systemd LimitNOFILE, and sysctl fs.file-max.
- Fix: Change all 3 layers together: echo "fs.file-max = 2097152" >> /etc/sysctl.conf && sysctl -p (sysctl needs a plain integer — 2M is rejected), then /etc/security/limits.conf: app soft nofile 65535 app hard nofile 65535, then systemd: systemctl set-property myapp.service LimitNOFILE=65535. Verify: cat /proc/<PID>/limits shows actual limits in effect.
SSH key with permissions > 600 is silently rejected
- You copy private key to a new machine with default umask 0022. Key is -rw-r--r-- (644). SSH says "Permissions too open on /home/user/.ssh/id_ed25519, ignoring key file." You spend 1 hour thinking there's a key mismatch or agent issue.
- Fix: Always chmod 600 ~/.ssh/id_* after copying keys. Or set umask before copying: umask 077 && cp key ~/.ssh/. SSH is paranoid — it won't use keys that anyone else can read (group/world readable).
Port is listening but unreachable from outside; firewall check fails in one direction
- Service listens on port 8080: ss -tlnp | grep 8080 shows it. External nc -zv host 8080 times out. You think it's a network issue. Actually, you checked the inbound firewall but not the outbound allow on the source machine.
- Fix: Check both directions: sudo ufw allow in 8080/tcp && sudo ufw allow out 8080/tcp (Ubuntu) or firewall-cmd (RHEL). From another machine: nc -zv -w 5 host 8080. If it times out, check the firewall on BOTH the target (inbound rule) and source (outbound rule); many networks restrict outbound ports.

Common Gotchas

rm -rf $DIR with empty variable — expands to rm -rf /. Always verify the variable.
OOM-killed process (silent) — check dmesg | grep oom — kernel doesn't log to app logs.
Cron "command not found" — minimal $PATH in cron. Use absolute paths.
ulimit doesn't stick — check 3 layers: limits.conf, systemd LimitNOFILE, sysctl fs.file-max. Minimum wins.
SSH key permissions > 600 — SSH refuses them. chmod 600 ~/.ssh/id_ed25519.
Port open but unreachable — check firewall both directions (inbound and outbound).

Performance triage in 90 seconds

When the on-call alert hits, you have time for one connect, one paste, one decision. The five commands below answer the questions "is the box healthy" and "where is the bottleneck" before you've finished reading the alert. Run them top-to-bottom, look at the output once, and route to the right deep dive.

# 1. Top — what is hot right now (sorted by CPU, refreshed every second)
top -b -n 1 -o %CPU | head -20
 
# 2. vmstat — run-queue depth, context switches, swap activity (5 samples, 1s apart)
vmstat 1 5
 
# 3. iostat — per-device queue depth, await, %util (the disk truth)
iostat -xz 1 3
 
# 4. ss — listen backlog, established/time-wait counts, retransmits
ss -s && ss -tnp state established | head
 
# 5. dmesg — last 30 kernel events: OOM-killer, segfaults, NIC resets, EXT4 errors
dmesg -T --level=err,warn | tail -30

What each line tells you: top reveals the heavy process and load average — load > cores means CPU saturation. vmstat's r column is the run-queue depth (anything sustained over nproc is CPU-starved); the si/so columns being non-zero means the box is swapping and you should stop reading and reboot or kill the leaker. iostat's %util near 100 with await over 50ms is disk-bound; r_await > w_await points at read-heavy hotspots. ss -s shows the socket summary — if synrecv or timewait is unbounded you are TCP-saturated, not application-slow. dmesg is the only place the kernel admits it killed your process.

perf for production CPU profiles

When top blames a process but you need to know which function inside it is burning CPU, perf samples the running stack at 99 Hz and produces a flame graph that reads top-down: wide bars are hot, deep stacks are call paths. The workflow below profiles PID 1234 for 30 seconds, then renders SVG you can open in any browser.

# Capture stack samples (-g = call graphs, -F 99 = 99 Hz, --all-user = skip kernel time)
sudo perf record -F 99 -g --all-user -p 1234 -- sleep 30
 
# Fold stacks into the format flamegraph.pl expects
sudo perf script | /opt/FlameGraph/stackcollapse-perf.pl > out.folded
 
# Render the SVG (one HTML file, no JS server needed)
/opt/FlameGraph/flamegraph.pl out.folded > flame.svg

Wide plateaus near the top are the functions actually consuming CPU; tall narrow towers are deep call chains that are not hot. Add -e cycles:pp for precise sampling on Intel skid-prone loops, and --call-graph dwarf if your binaries lack frame pointers (Go binaries built before 1.20 are the usual offender).

eBPF tracing without recompiling

bpftrace attaches probes to live kernel and userspace functions without restarting the process — so you can answer "which file is the slow read" or "which syscalls is PID 1234 actually making" on a production box that you cannot recompile. The two one-liners below cover the most common questions.

# Histogram of read() latency by PID — find the slow reader
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
  tracepoint:syscalls:sys_exit_read /@start[tid]/
  { @ns[pid, comm] = hist(nsecs - @start[tid]); delete(@start[tid]); }'
 
# Block I/O latency by device — is the SSD actually slow, or is queueing the problem?
sudo bpftrace -e 'tracepoint:block:block_rq_issue { @start[args->dev] = nsecs; }
  tracepoint:block:block_rq_complete /@start[args->dev]/
  { @us[args->dev] = hist((nsecs - @start[args->dev]) / 1000); }'

Press Ctrl+C and bpftrace prints power-of-two histograms — the bucket where most events land is your real latency, not the average. Overhead is sub-1% on modern kernels (5.4+) and the probes detach cleanly on exit, which is why this replaced strace -c for production tracing. ^{[Linux kernel docs]}

Frequently Asked Questions

What's filling the disk?

Run du -sh /* to find the fat directory, then du -sh /path/* recursively to drill in. Typical culprits are /var/log, /var/cache, and Docker's /var/lib/docker.

How do I kill a zombie process?

You can't kill a zombie directly — they're waiting for their parent to reap them. Find the parent with ps -o ppid= -p <ZOMBIE>, then kill -9 <PARENT> to force the parent to exit (init then reaps the orphaned zombie).

Is a port open from outside the host?

On the server: ss -tlnp | grep :8080 shows whether anything is listening. From outside: nc -zv host 8080 probes connectivity. If the server listens but the probe fails, check the firewall in both directions.

Keep Reading

Essential Docker Commands Cheat Sheet — Container lifecycle and debugging.
Essential Kubernetes Commands Cheat Sheet — When the burning service is running on Kubernetes, route through kubectl first.
The 3 Pillars of Observability — Metrics, logs, traces when CLI tools aren't enough.
SRE: SLOs, SLIs, and Error Budgets — When triage is over, the next question is "did this burn an error budget."
Go Graceful Shutdown in Production — The application-side counterpart to journalctl -u svc -b after a crash loop.

Was this article helpful?

Your feedback directly shapes our editorial depth and technical accuracy.

BackendBytes Engineering Team

Engineering Team

A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.

Essential Linux Commands: A Backend Engineer's Cheat Sheet

Key Takeaways

Production Debugging: When You SSH Into a Burning Server

Triage by Symptom

The Essential Dozen

File & Text Inspection

Process & Performance Monitoring

Service Management (systemd & journalctl)

Networking & Connectivity

SSH & Permissions

Firewall & Scheduling

Disk & Kernel Tuning

Advanced Debugging

When to use what

Gotchas that bite in production

Common Gotchas

Performance triage in 90 seconds

perf for production CPU profiles

eBPF tracing without recompiling

Frequently Asked Questions

Keep Reading

Was this article helpful?

Read Next

Essential Kubernetes Commands: The Complete kubectl Cheat Sheet

Essential Docker Commands: The Complete Cheat Sheet

Essential Git Commands: The Complete Developer Cheat Sheet

Essential Kubernetes Commands: The Complete kubectl Cheat Sheet

Essential Docker Commands: The Complete Cheat Sheet

Essential Git Commands: The Complete Developer Cheat Sheet