tests/regress.py: prevent WiFiDriverDemo process leaks#62
Merged
Conversation
Long-running Popens spawned by run_cell can outlive the script if either (a) an outer `timeout` / SIGKILL hits Python before run_cell's finally block executes, or (b) an unhandled exception aborts the run. The orphaned child reparents to init (PPID=1) and keeps the USB device claimed, causing kernel-side `usbfs: process N (WiFiDriverDemo) did not claim interface 0 before use` spam and preventing `aircrack-ng/88XXau` from binding in the VM on subsequent cells. A leaked process from yesterday's debugging session ran for 2:58 hours against the 8812AU and wedged the chip for the entire investigation — every "can't set config #1, error -32" symptom was downstream of that. Defense in depth (two independent layers): 1. atexit + SIGTERM/SIGINT/SIGHUP handler walks a module-level _ACTIVE_LOCAL_PROCS set and SIGKILLs every Popen that's still running. Re-raises the signal at default disposition so the exit code reflects how the script died. 2. preexec_fn for each spawned Popen calls prctl(PR_SET_PDEATHSIG, SIGKILL) + os.setsid(). The PDEATHSIG asks the kernel to kill the child the moment its parent dies — catches the pathological SIGKILL-from-outer-timeout case where Python has zero opportunity to run cleanup code. Replaces the previous `start_new_session=True` (which only handled the setsid half). Wired into _spawn_devourer_rx, _spawn_devourer_tx, _spawn_sniffer (all local Popens). _spawn_kernel_rx / _spawn_kernel_tx remain untouched — those run remotely via ssh and die naturally when the ssh tunnel breaks. _install_cleanup_handlers() is called once at the top of main(). Verified: - `kill -KILL $(python_pid)` mid-spawn: PR_SET_PDEATHSIG kills the WiFiDriverDemo child within ~ms (verified by `ps --ppid`). - `kill -TERM $(python_pid)` mid-spawn: signal handler runs _kill_all_local_procs, child gone before harness exit. - Full 4-cell matrix at ch100: zero leaks after the run (`ps -ef | grep WiFiDriver | wc -l` == 0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e443eaf to
a46d6b4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Long-running Popens spawned by
run_cellcan outlive the script if:timeout/ SIGKILL hits Python beforerun_cell'sfinallyblock executes (orphan reparents to init, PPID=1)Leaked
WiFiDriverDemoprocesses keep the USB device claimed, which manifests in the kernel as:flooding dmesg, and prevents
aircrack-ng/88XXaufrom binding in the VM on subsequent cells (can't set config #1, error -32).A leaked process during the 2026-05-30 / -31 8821AU 5GHz UNII-2 investigation ran for 2:58 hours against the 8812AU and wedged the chip for the whole session — every "hardware-wedge" symptom was downstream of that.
What changes
Defense in depth (two independent layers):
atexit+ SIGTERM/SIGINT/SIGHUP handler walks a module-level_ACTIVE_LOCAL_PROCSset and SIGKILLs every Popen that's still running. Re-raises the signal at default disposition so the exit code reflects how the script died.preexec_fnfor each spawned Popen callsprctl(PR_SET_PDEATHSIG, SIGKILL)+os.setsid(). The PDEATHSIG asks the kernel to kill the child the moment its parent dies — catches the pathological SIGKILL-from-outer-timeout case where Python has zero opportunity to run cleanup code. Replaces the previousstart_new_session=True(which only handled the setsid half).Wired into
_spawn_devourer_rx,_spawn_devourer_tx,_spawn_sniffer(all local Popens)._spawn_kernel_rx/_spawn_kernel_txremain untouched — those run remotely via ssh and die naturally when the ssh tunnel breaks._install_cleanup_handlers()is called once at the top ofmain().Test plan
kill -KILL \$(python_pid)mid-spawn —WiFiDriverDemochild dies within ms (verified byps --ppid).kill -TERM \$(python_pid)mid-spawn — handler runs_kill_all_local_procs, child gone before harness exit.ps -ef | grep WiFiDriver | wc -l== 0).🤖 Generated with Claude Code