fix: worker restart collection mismatch in loadscope/loadgroup scheduler#1348
Open
C1-BA-B1-F3 wants to merge 1 commit into
Open
fix: worker restart collection mismatch in loadscope/loadgroup scheduler#1348C1-BA-B1-F3 wants to merge 1 commit into
C1-BA-B1-F3 wants to merge 1 commit into
Conversation
Fix two related issues when a worker crashes and is restarted: 1. **Issue pytest-dev#1189**: When a worker crashes, its entry in registered_collections was not cleaned up. This caused collection_is_completed to remain True, and replacement workers whose collections didn't match the original would have their collections silently dropped, leading to KeyError when trying to assign work. 2. **Issue pytest-dev#1323**: When a worker crashed, ALL work units (including completed ones) were added back to the workqueue. When these completed work units were assigned to a new worker, nodeids_indexes would be empty, causing the worker to hang waiting for work that never arrives. Changes: - remove_node(): Now removes the crashed node from registered_collections - remove_node(): Only requeues work units that have pending (uncompleted) items - add_node_collection(): Always registers collection even if it doesn't match, allowing _check_nodes_have_same_collection() to report the mismatch properly Fixes pytest-dev#1189, pytest-dev#1323
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes two related issues when a worker crashes and is restarted with
--dist loadscopeor--dist loadgroup:Issue #1189: KeyError when replacement worker's collection doesn't match
When a worker crashes, its entry in
registered_collectionswas not cleaned up. This caused:collection_is_completedto remain True even though the crashed node was gonereturnstatement inadd_node_collection())schedule()tried to assign work to the replacement worker, it would raiseKeyErrorbecause the worker wasn't inregistered_collectionsIssue #1323: Hang when completed work units are requeued
When a worker crashed, ALL work units (including completed ones) were added back to the workqueue via
self.workqueue.update(workload). When these completed work units were later assigned to a new worker:_assign_work_unit()would produce an emptynodeids_indexeslist (all items marked as completed)node.send_runtest_some([])would be called with an empty listChanges
src/xdist/scheduler/loadscope.pyremove_node(): Now removes the crashed node fromregistered_collectionsusingself.registered_collections.pop(node, None). This allows replacement workers to properly register their collections.remove_node(): Only requeues work units that have pending (uncompleted) items. Completed work units are filtered out before adding to the workqueue.add_node_collection(): Always registers the collection even if it doesn't match the existing collection. The mismatch will be properly reported by_check_nodes_have_same_collection()whenschedule()is called.Testing
All existing tests pass (220 passed, 6 skipped, 10 xfailed).
Related Issues
Fixes #1189
Fixes #1323