🐛 fix(run): break deadlock in execution interrupt chain#3869
🐛 fix(run): break deadlock in execution interrupt chain#3869gaborbernat merged 2 commits intotox-dev:mainfrom
Conversation
300cef3 to
1939c55
Compare
|
We should do similar change to |
53486d3 to
9eddf5e
Compare
On Windows CI (~1/40 runs), a subprocess can hang indefinitely during environment setup — either in virtualenv's interpreter discovery or during package installation/provisioning. This created an unbreakable deadlock: thread.join() blocked the main thread so signals couldn't be delivered, as_completed() blocked the interrupt thread so it couldn't check the interrupt event, and executor.shutdown(wait=True) prevented done.set() from ever firing. Replace the blocking as_completed() with a polling _next_completed() that checks the interrupt event every second, make the interrupt thread a daemon so the process can exit if it's stuck, use timeout loops for thread.join() so signals can be delivered, and skip waiting for stuck workers on shutdown when interrupted. This affected 18 flaky timeouts across 9 different tests in the last 30 days (89% Windows, 11% macOS).
for more information, see https://pre-commit.ci
That's not needed. Let me explain why: status.wait() at api.py:462 blocks in process.wait() (which is WaitForSingleObject on Windows). But it gets unblocked when status.interrupt() is called, which kills the process:
Both paths now properly call interrupt(), which terminates the subprocess, which makes WaitForSingleObject return immediately. Polling with wait(timeout=1) would just add CPU overhead checking every second when |
On Windows CI (~1/40 runs), a subprocess can hang indefinitely during environment setup — either in virtualenv's interpreter discovery (Pattern A) or during package installation/provisioning (Pattern B). Analysis of the last 30 days of CI revealed 18 flaky timeout failures across 9 different tests, 89% on
windows-2025and 11% onmacos-15. 🪟 The affected tests are not specific — any test that runs tox in-process where a subprocess hangs triggers the same deadlock.The root cause is an unbreakable deadlock chain in
common.py.thread.join()blocks the main thread indefinitely so signals frompytest-timeoutcan never be delivered.as_completed()blocks thetox-interruptthread so it can never check theinterruptevent. Andexecutor.shutdown(wait=True)preventsdone.set()from firing even after an interrupt is acknowledged. For Pattern B,tox_env.interrupt()would kill the hung subprocess since it's tracked in_execute_statuses, but it can never fire becauseKeyboardInterruptcan't reach the blocked main thread.thread.join(timeout=1)loop_next_completedwith interrupt checkexecutor.shutdown(wait=not interrupted)daemon=Trueon threaddone.wait(timeout=5)⏱️ The blocking
as_completed()is replaced with a polling_next_completed()that checks theinterruptevent every second viaconcurrent.futures.wait(timeout=1, return_when=FIRST_COMPLETED). The interrupt thread is made daemon so the process can exit if it's stuck.thread.join()uses a timeout loop so signals can be delivered on Windows (wherelock.acquire()without timeout ignores signals). The interrupt handler gets bounded waits so cleanup doesn't hang forever.For Pattern A, the upstream fix is in tox-dev/python-discovery#42 which adds a 5s timeout to
process.communicate()in_run_subprocess. Together, these changes eliminate both hang patterns.