- Debug in a loop of hypothesizing and experimenting.
- Mentally model system; hypothesize what's wrong if you can.
- Imagine what data would most precisely confirm, disprove, or refine our theory.
- Get that data.
- Confirm, reject, or refine the model.
- In the most orderly debugging processes, we steadily refine our hypothesis by collecting data of intuitive value.
- Logs
- e.g., syslog, systemed journal, Splunk, Kibana, /var/log, docker logs, database logs, kubectl logs.
- try filtering by process name, service name, hostname, screen, username, entity ID, time.
- Metrics
- e.g., counts, latencies, rates like on-disk bytes, response time, error rate.
- => describe a system's aggregate behavior, and capture general systemic degradation.
- Chrome dev tools
- e.g., UI misssing behavior, failing requests to the backend, or weird data.
- Just grep the code
- Add new instrumentation
- => add logs on your laptop, in staging, or in a production environment.
- Local system monitors
- e.g.,
top
,htop
,iotop
,iostat
.
- e.g.,
- git bisect
- => check out one commit at a time, and test it until you find the earliest broken one.
- Debugger
- printing variables, stepping through functions, and scripting data structure traversals.
- CPU profiler
- provides a view of where a program spends its CPU cycles over time => understand performance problems and crashes.
- Heap introspection
- => for memory leaks.
- tcpdump/ngrep/wireshark
- TCP interactions, encoding issues, or network issues.
- Tracing frameworks
- e.g., Linux's strace, Linux perf, DTrace.
- The steps to run an experiment:
- installing builds, restarting systems, changing configuration, hooking up external instruments, monitoring during the run, and finally collecting the right data at the end.
- Two methods:
- Making a detailed runbook (checklist) ensure that all dashboarding, data collection docs, are prepared in advance.
- Recording everything done during the experiment.
- with screenshots, copy-pastes, and links in a single Google Doc; time-stamping the completion of stages of the checklist.