🐛 fix(run): break deadlock in execution interrupt chain by gaborbernat · Pull Request #3869 · tox-dev/tox

gaborbernat · 2026-03-07T03:07:41Z

On Windows CI (~1/40 runs), a subprocess can hang indefinitely during environment setup — either in virtualenv's interpreter discovery (Pattern A) or during package installation/provisioning (Pattern B). Analysis of the last 30 days of CI revealed 18 flaky timeout failures across 9 different tests, 89% on windows-2025 and 11% on macos-15. 🪟 The affected tests are not specific — any test that runs tox in-process where a subprocess hangs triggers the same deadlock.

The root cause is an unbreakable deadlock chain in common.py. thread.join() blocks the main thread indefinitely so signals from pytest-timeout can never be delivered. as_completed() blocks the tox-interrupt thread so it can never check the interrupt event. And executor.shutdown(wait=True) prevents done.set() from firing even after an interrupt is acknowledged. For Pattern B, tox_env.interrupt() would kill the hung subprocess since it's tracked in _execute_statuses, but it can never fire because KeyboardInterrupt can't reach the blocked main thread.

Fix	Pattern A (discovery)	Pattern B (install/provision)
`thread.join(timeout=1)` loop	Allows signal delivery	Allows signal delivery
`_next_completed` with interrupt check	Breaks out of wait	Breaks out of wait
`executor.shutdown(wait=not interrupted)`	Skips stuck worker	Skips stuck worker
`daemon=True` on thread	Process can exit	Process can exit
`done.wait(timeout=5)`	Bounded cleanup	Bounded cleanup

⏱️ The blocking as_completed() is replaced with a polling _next_completed() that checks the interrupt event every second via concurrent.futures.wait(timeout=1, return_when=FIRST_COMPLETED). The interrupt thread is made daemon so the process can exit if it's stuck. thread.join() uses a timeout loop so signals can be delivered on Windows (where lock.acquire() without timeout ignores signals). The interrupt handler gets bounded waits so cleanup doesn't hang forever.

For Pattern A, the upstream fix is in tox-dev/python-discovery#42 which adds a 5s timeout to process.communicate() in _run_subprocess. Together, these changes eliminate both hang patterns.

rahuldevikar · 2026-03-09T16:45:27Z

We should do similar change to LocalSubprocessExecuteStatus.wait() in __init__.py so it periodically yeilds control and checks for interrupts instead of blocking indefinitely on WaitForSingleObject

On Windows CI (~1/40 runs), a subprocess can hang indefinitely during environment setup — either in virtualenv's interpreter discovery or during package installation/provisioning. This created an unbreakable deadlock: thread.join() blocked the main thread so signals couldn't be delivered, as_completed() blocked the interrupt thread so it couldn't check the interrupt event, and executor.shutdown(wait=True) prevented done.set() from ever firing. Replace the blocking as_completed() with a polling _next_completed() that checks the interrupt event every second, make the interrupt thread a daemon so the process can exit if it's stuck, use timeout loops for thread.join() so signals can be delivered, and skip waiting for stuck workers on shutdown when interrupted. This affected 18 flaky timeouts across 9 different tests in the last 30 days (89% Windows, 11% macOS).

for more information, see https://pre-commit.ci

gaborbernat · 2026-03-09T19:14:34Z

We should do similar change to LocalSubprocessExecuteStatus.wait() in __init__.py so it periodically yeilds control and checks for interrupts instead of blocking indefinitely on WaitForSingleObject

That's not needed. Let me explain why:

status.wait() at api.py:462 blocks in process.wait() (which is WaitForSingleObject on Windows). But it gets unblocked when status.interrupt() is called, which kills the process:

Ctrl+C path: execute()'s handler calls tox_env.interrupt() → status.interrupt() → process killed → wait() returns
Fail-fast path (with the fix we just added): pending_env.interrupt() → status.interrupt() → process killed → wait() returns

Both paths now properly call interrupt(), which terminates the subprocess, which makes WaitForSingleObject return immediately. Polling with wait(timeout=1) would just add CPU overhead checking every second when
the process is running normally -- and it still wouldn't help because there's nothing to check FOR. The interrupt is delivered by killing the process, not by checking a flag.

gaborbernat requested a review from rahuldevikar as a code owner March 7, 2026 03:07

gaborbernat added bug:normal affects many people or has quite an impact os:windows os:macOS labels Mar 7, 2026

gaborbernat marked this pull request as draft March 7, 2026 03:08

gaborbernat force-pushed the flaky branch 3 times, most recently from 300cef3 to 1939c55 Compare March 9, 2026 15:56

rahuldevikar approved these changes Mar 9, 2026

View reviewed changes

gaborbernat force-pushed the flaky branch 2 times, most recently from 53486d3 to 9eddf5e Compare March 9, 2026 18:01

psf-chronographer bot added the bot:chronographer:provided label Mar 9, 2026

gaborbernat force-pushed the flaky branch from 244391a to f53fd7a Compare March 9, 2026 19:07

[pre-commit.ci] auto fixes from pre-commit.com hooks

9de2cd2

for more information, see https://pre-commit.ci

gaborbernat marked this pull request as ready for review March 9, 2026 19:14

gaborbernat enabled auto-merge (squash) March 9, 2026 19:14

gaborbernat merged commit e3876f3 into tox-dev:main Mar 9, 2026
27 checks passed

gaborbernat deleted the flaky branch March 9, 2026 23:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🐛 fix(run): break deadlock in execution interrupt chain#3869

🐛 fix(run): break deadlock in execution interrupt chain#3869
gaborbernat merged 2 commits intotox-dev:mainfrom
gaborbernat:flaky

gaborbernat commented Mar 7, 2026

Uh oh!

rahuldevikar commented Mar 9, 2026

Uh oh!

gaborbernat commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gaborbernat commented Mar 7, 2026

Uh oh!

rahuldevikar commented Mar 9, 2026

Uh oh!

gaborbernat commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants