fix: download gets permanently stuck after cancellation#1614
Draft
AlexCheema wants to merge 1 commit intomainfrom
Draft
fix: download gets permanently stuck after cancellation#1614AlexCheema wants to merge 1 commit intomainfrom
AlexCheema wants to merge 1 commit intomainfrom
Conversation
_cancel_download() cancelled the asyncio task but never cleared the coordinator's local download_status or emitted an event to update global state. Both remained stuck on DownloadOngoing, causing any new StartDownload for the same model to be silently skipped. Also fixes apply_node_download_progress to match by model_id instead of full shard_metadata equality, preventing duplicate download entries when instance type changes (e.g. MlxJaccl -> MlxRing). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
When a model instance is deleted and recreated (or changed between instance types like MlxJaccl ↔ MlxRing), the download for that model gets permanently stuck. The state shows both
DownloadPendingandDownloadOngoingentries for the same model but with different shard metadata types. Deleting and recreating the instance never triggers a new download — it's stuck forever.Changes
1. Fix
_cancel_downloadto reset state (src/exo/download/coordinator.py)download_statusso coordinator accepts newStartDownloadcommandsDownloadPendingevent to reset global state (only if status wasDownloadOngoing)_last_progress_timethrottle tracking2. Handle
CancelledErrorindownload_wrapper(src/exo/download/coordinator.py)asyncio.CancelledErroris aBaseException, notException— was silently bypassing cleanupdownload_statuson cancellation3. Fix
apply_node_download_progressto match bymodel_id(src/exo/shared/apply.py)shard_metadataequality — caused duplicate entries when shard type changedmodel_card.model_id, consistent with_model_needs_downloadand coordinator dict key4. Add test for model_id-based replacement (
src/exo/shared/tests/)Why It Works
_cancel_download()was cancelling the asyncio task but never updating either the coordinator's localdownload_statusdict or the global event-sourced state. After cancellation:_start_download()sees staleDownloadOngoingindownload_status→ returns early ("already in progress")_model_needs_download()sees staleDownloadOngoingin global state → doesn't create new taskThe fix mirrors the existing cleanup pattern from
_delete_download()(which already handles this correctly).Test Plan
Manual Testing
Automated Testing
test_apply_download_progress_replaces_when_shard_metadata_type_changestest