Skip to content

Conversation

@bendrucker
Copy link

@bendrucker bendrucker commented Jan 23, 2026

Fixes a KeyError crash when a replacement worker's test collection doesn't match the original. This can occur when using --dist=loadgroup and a worker crashes, then the replacement worker sees different tests (e.g., due to a race condition or test file changes during the run).

Changes

  • Add a guard in _reschedule() to skip nodes that aren't in registered_collections
  • Add regression test that simulates the crash scenario

Root Cause

When a worker crashes and is replaced:

  1. The replacement worker collects tests via add_node_collection()
  2. If the collection doesn't match the original, the method returns early without adding the node to registered_collections
  3. However, the node was already added to assigned_work via add_node()
  4. Later, schedule() iterates over self.nodes (from assigned_work.keys()) and calls _reschedule() on each
  5. _reschedule() calls _assign_work_unit() which crashes with KeyError

The fix adds a simple guard to skip nodes that aren't properly registered.

Related

When a worker crashes and is replaced, if the replacement worker's
collection doesn't match the original, add_node_collection() returns
early without adding the node to registered_collections. Later, when
schedule() calls _reschedule() on all nodes, _assign_work_unit() crashes
with KeyError accessing registered_collections[node].

Add a guard in _reschedule() to skip nodes that aren't in
registered_collections.

Fixes pytest-dev#1189
Fixes pytest-dev#714
@bendrucker bendrucker force-pushed the fix-replacement-worker-keyerror branch from 9a9e365 to 78bec67 Compare January 23, 2026 04:55
@bendrucker bendrucker changed the title Fix KeyError when replacement worker has mismatched collection Fix KeyError when replacement worker has mismatched collection Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant