Skip to content

fix: worker restart collection mismatch in loadscope/loadgroup scheduler#1348

Open
C1-BA-B1-F3 wants to merge 1 commit into
pytest-dev:masterfrom
C1-BA-B1-F3:fix/worker-restart-collection-mismatch
Open

fix: worker restart collection mismatch in loadscope/loadgroup scheduler#1348
C1-BA-B1-F3 wants to merge 1 commit into
pytest-dev:masterfrom
C1-BA-B1-F3:fix/worker-restart-collection-mismatch

Conversation

@C1-BA-B1-F3

Copy link
Copy Markdown

Description

This PR fixes two related issues when a worker crashes and is restarted with --dist loadscope or --dist loadgroup:

Issue #1189: KeyError when replacement worker's collection doesn't match

When a worker crashes, its entry in registered_collections was not cleaned up. This caused:

  1. collection_is_completed to remain True even though the crashed node was gone
  2. Replacement workers whose collections didn't match the original would have their collections silently dropped (the return statement in add_node_collection())
  3. When schedule() tried to assign work to the replacement worker, it would raise KeyError because the worker wasn't in registered_collections

Issue #1323: Hang when completed work units are requeued

When a worker crashed, ALL work units (including completed ones) were added back to the workqueue via self.workqueue.update(workload). When these completed work units were later assigned to a new worker:

  1. The filtering in _assign_work_unit() would produce an empty nodeids_indexes list (all items marked as completed)
  2. node.send_runtest_some([]) would be called with an empty list
  3. The worker would hang waiting for work that never arrives

Changes

src/xdist/scheduler/loadscope.py

  1. remove_node(): Now removes the crashed node from registered_collections using self.registered_collections.pop(node, None). This allows replacement workers to properly register their collections.

  2. remove_node(): Only requeues work units that have pending (uncompleted) items. Completed work units are filtered out before adding to the workqueue.

  3. add_node_collection(): Always registers the collection even if it doesn't match the existing collection. The mismatch will be properly reported by _check_nodes_have_same_collection() when schedule() is called.

Testing

All existing tests pass (220 passed, 6 skipped, 10 xfailed).

Related Issues

Fixes #1189
Fixes #1323

Fix two related issues when a worker crashes and is restarted:

1. **Issue pytest-dev#1189**: When a worker crashes, its entry in registered_collections
   was not cleaned up. This caused collection_is_completed to remain True,
   and replacement workers whose collections didn't match the original would
   have their collections silently dropped, leading to KeyError when trying
   to assign work.

2. **Issue pytest-dev#1323**: When a worker crashed, ALL work units (including completed
   ones) were added back to the workqueue. When these completed work units
   were assigned to a new worker, nodeids_indexes would be empty, causing
   the worker to hang waiting for work that never arrives.

Changes:
- remove_node(): Now removes the crashed node from registered_collections
- remove_node(): Only requeues work units that have pending (uncompleted) items
- add_node_collection(): Always registers collection even if it doesn't match,
  allowing _check_nodes_have_same_collection() to report the mismatch properly

Fixes pytest-dev#1189, pytest-dev#1323
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant