Skip to content

tests/regress.py: prevent WiFiDriverDemo process leaks#62

Merged
josephnef merged 1 commit into
masterfrom
fix-regress-process-leak
Jun 1, 2026
Merged

tests/regress.py: prevent WiFiDriverDemo process leaks#62
josephnef merged 1 commit into
masterfrom
fix-regress-process-leak

Conversation

@josephnef
Copy link
Copy Markdown
Collaborator

Why

Long-running Popens spawned by run_cell can outlive the script if:

  • an outer timeout / SIGKILL hits Python before run_cell's finally block executes (orphan reparents to init, PPID=1)
  • an unhandled exception aborts the run

Leaked WiFiDriverDemo processes keep the USB device claimed, which manifests in the kernel as:

usbfs: process N (WiFiDriverDemo) did not claim interface 0 before use

flooding dmesg, and prevents aircrack-ng/88XXau from binding in the VM on subsequent cells (can't set config #1, error -32).

A leaked process during the 2026-05-30 / -31 8821AU 5GHz UNII-2 investigation ran for 2:58 hours against the 8812AU and wedged the chip for the whole session — every "hardware-wedge" symptom was downstream of that.

What changes

Defense in depth (two independent layers):

  1. atexit + SIGTERM/SIGINT/SIGHUP handler walks a module-level _ACTIVE_LOCAL_PROCS set and SIGKILLs every Popen that's still running. Re-raises the signal at default disposition so the exit code reflects how the script died.

  2. preexec_fn for each spawned Popen calls prctl(PR_SET_PDEATHSIG, SIGKILL) + os.setsid(). The PDEATHSIG asks the kernel to kill the child the moment its parent dies — catches the pathological SIGKILL-from-outer-timeout case where Python has zero opportunity to run cleanup code. Replaces the previous start_new_session=True (which only handled the setsid half).

Wired into _spawn_devourer_rx, _spawn_devourer_tx, _spawn_sniffer (all local Popens). _spawn_kernel_rx / _spawn_kernel_tx remain untouched — those run remotely via ssh and die naturally when the ssh tunnel breaks.

_install_cleanup_handlers() is called once at the top of main().

Test plan

  • PR_SET_PDEATHSIG works: kill -KILL \$(python_pid) mid-spawn — WiFiDriverDemo child dies within ms (verified by ps --ppid).
  • Signal-handler path works: kill -TERM \$(python_pid) mid-spawn — handler runs _kill_all_local_procs, child gone before harness exit.
  • End-to-end matrix: full 4-cell matrix at ch100 — zero leaks after the run (ps -ef | grep WiFiDriver | wc -l == 0).
  • CI matrix

🤖 Generated with Claude Code

Long-running Popens spawned by run_cell can outlive the script if
either (a) an outer `timeout` / SIGKILL hits Python before run_cell's
finally block executes, or (b) an unhandled exception aborts the run.
The orphaned child reparents to init (PPID=1) and keeps the USB device
claimed, causing kernel-side `usbfs: process N (WiFiDriverDemo) did
not claim interface 0 before use` spam and preventing
`aircrack-ng/88XXau` from binding in the VM on subsequent cells.

A leaked process from yesterday's debugging session ran for 2:58 hours
against the 8812AU and wedged the chip for the entire investigation —
every "can't set config #1, error -32" symptom was downstream of that.

Defense in depth (two independent layers):

1. atexit + SIGTERM/SIGINT/SIGHUP handler walks a module-level
   _ACTIVE_LOCAL_PROCS set and SIGKILLs every Popen that's still
   running. Re-raises the signal at default disposition so the exit
   code reflects how the script died.

2. preexec_fn for each spawned Popen calls
   prctl(PR_SET_PDEATHSIG, SIGKILL) + os.setsid(). The PDEATHSIG asks
   the kernel to kill the child the moment its parent dies — catches
   the pathological SIGKILL-from-outer-timeout case where Python has
   zero opportunity to run cleanup code. Replaces the previous
   `start_new_session=True` (which only handled the setsid half).

Wired into _spawn_devourer_rx, _spawn_devourer_tx, _spawn_sniffer
(all local Popens). _spawn_kernel_rx / _spawn_kernel_tx remain
untouched — those run remotely via ssh and die naturally when the
ssh tunnel breaks.

_install_cleanup_handlers() is called once at the top of main().

Verified:
- `kill -KILL $(python_pid)` mid-spawn: PR_SET_PDEATHSIG kills the
  WiFiDriverDemo child within ~ms (verified by `ps --ppid`).
- `kill -TERM $(python_pid)` mid-spawn: signal handler runs
  _kill_all_local_procs, child gone before harness exit.
- Full 4-cell matrix at ch100: zero leaks after the run
  (`ps -ef | grep WiFiDriver | wc -l` == 0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@josephnef josephnef force-pushed the fix-regress-process-leak branch from e443eaf to a46d6b4 Compare June 1, 2026 10:24
@josephnef josephnef merged commit 9615d92 into master Jun 1, 2026
5 checks passed
@josephnef josephnef deleted the fix-regress-process-leak branch June 1, 2026 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant