Skip to content

🐛 fix(run): break deadlock in execution interrupt chain#3869

Merged
gaborbernat merged 2 commits intotox-dev:mainfrom
gaborbernat:flaky
Mar 9, 2026
Merged

🐛 fix(run): break deadlock in execution interrupt chain#3869
gaborbernat merged 2 commits intotox-dev:mainfrom
gaborbernat:flaky

Conversation

@gaborbernat
Copy link
Copy Markdown
Member

On Windows CI (~1/40 runs), a subprocess can hang indefinitely during environment setup — either in virtualenv's interpreter discovery (Pattern A) or during package installation/provisioning (Pattern B). Analysis of the last 30 days of CI revealed 18 flaky timeout failures across 9 different tests, 89% on windows-2025 and 11% on macos-15. 🪟 The affected tests are not specific — any test that runs tox in-process where a subprocess hangs triggers the same deadlock.

The root cause is an unbreakable deadlock chain in common.py. thread.join() blocks the main thread indefinitely so signals from pytest-timeout can never be delivered. as_completed() blocks the tox-interrupt thread so it can never check the interrupt event. And executor.shutdown(wait=True) prevents done.set() from firing even after an interrupt is acknowledged. For Pattern B, tox_env.interrupt() would kill the hung subprocess since it's tracked in _execute_statuses, but it can never fire because KeyboardInterrupt can't reach the blocked main thread.

Fix Pattern A (discovery) Pattern B (install/provision)
thread.join(timeout=1) loop Allows signal delivery Allows signal delivery
_next_completed with interrupt check Breaks out of wait Breaks out of wait
executor.shutdown(wait=not interrupted) Skips stuck worker Skips stuck worker
daemon=True on thread Process can exit Process can exit
done.wait(timeout=5) Bounded cleanup Bounded cleanup

⏱️ The blocking as_completed() is replaced with a polling _next_completed() that checks the interrupt event every second via concurrent.futures.wait(timeout=1, return_when=FIRST_COMPLETED). The interrupt thread is made daemon so the process can exit if it's stuck. thread.join() uses a timeout loop so signals can be delivered on Windows (where lock.acquire() without timeout ignores signals). The interrupt handler gets bounded waits so cleanup doesn't hang forever.

For Pattern A, the upstream fix is in tox-dev/python-discovery#42 which adds a 5s timeout to process.communicate() in _run_subprocess. Together, these changes eliminate both hang patterns.

@gaborbernat gaborbernat added bug:normal affects many people or has quite an impact os:windows os:macOS labels Mar 7, 2026
@gaborbernat gaborbernat marked this pull request as draft March 7, 2026 03:08
@gaborbernat gaborbernat force-pushed the flaky branch 3 times, most recently from 300cef3 to 1939c55 Compare March 9, 2026 15:56
@rahuldevikar
Copy link
Copy Markdown
Collaborator

We should do similar change to LocalSubprocessExecuteStatus.wait() in __init__.py so it periodically yeilds control and checks for interrupts instead of blocking indefinitely on WaitForSingleObject

On Windows CI (~1/40 runs), a subprocess can hang indefinitely during
environment setup — either in virtualenv's interpreter discovery or
during package installation/provisioning. This created an unbreakable
deadlock: thread.join() blocked the main thread so signals couldn't be
delivered, as_completed() blocked the interrupt thread so it couldn't
check the interrupt event, and executor.shutdown(wait=True) prevented
done.set() from ever firing.

Replace the blocking as_completed() with a polling _next_completed()
that checks the interrupt event every second, make the interrupt thread
a daemon so the process can exit if it's stuck, use timeout loops for
thread.join() so signals can be delivered, and skip waiting for stuck
workers on shutdown when interrupted. This affected 18 flaky timeouts
across 9 different tests in the last 30 days (89% Windows, 11% macOS).
@gaborbernat gaborbernat marked this pull request as ready for review March 9, 2026 19:14
@gaborbernat gaborbernat enabled auto-merge (squash) March 9, 2026 19:14
@gaborbernat
Copy link
Copy Markdown
Member Author

We should do similar change to LocalSubprocessExecuteStatus.wait() in __init__.py so it periodically yeilds control and checks for interrupts instead of blocking indefinitely on WaitForSingleObject

That's not needed. Let me explain why:

status.wait() at api.py:462 blocks in process.wait() (which is WaitForSingleObject on Windows). But it gets unblocked when status.interrupt() is called, which kills the process:

  • Ctrl+C path: execute()'s handler calls tox_env.interrupt() → status.interrupt() → process killed → wait() returns
  • Fail-fast path (with the fix we just added): pending_env.interrupt() → status.interrupt() → process killed → wait() returns

Both paths now properly call interrupt(), which terminates the subprocess, which makes WaitForSingleObject return immediately. Polling with wait(timeout=1) would just add CPU overhead checking every second when
the process is running normally -- and it still wouldn't help because there's nothing to check FOR. The interrupt is delivered by killing the process, not by checking a flag.

@gaborbernat gaborbernat merged commit e3876f3 into tox-dev:main Mar 9, 2026
27 checks passed
@gaborbernat gaborbernat deleted the flaky branch March 9, 2026 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants