Skip to content

fix: download gets permanently stuck after cancellation#1614

Draft
AlexCheema wants to merge 1 commit intomainfrom
alexcheema/fix-stuck-download-after-cancel
Draft

fix: download gets permanently stuck after cancellation#1614
AlexCheema wants to merge 1 commit intomainfrom
alexcheema/fix-stuck-download-after-cancel

Conversation

@AlexCheema
Copy link
Contributor

Motivation

When a model instance is deleted and recreated (or changed between instance types like MlxJaccl ↔ MlxRing), the download for that model gets permanently stuck. The state shows both DownloadPending and DownloadOngoing entries for the same model but with different shard metadata types. Deleting and recreating the instance never triggers a new download — it's stuck forever.

Changes

1. Fix _cancel_download to reset state (src/exo/download/coordinator.py)

  • Clear download_status so coordinator accepts new StartDownload commands
  • Emit DownloadPending event to reset global state (only if status was DownloadOngoing)
  • Clean up _last_progress_time throttle tracking
  • Split the AND condition into two independent ifs so each cleanup runs regardless of the other

2. Handle CancelledError in download_wrapper (src/exo/download/coordinator.py)

  • asyncio.CancelledError is a BaseException, not Exception — was silently bypassing cleanup
  • Now defensively clears download_status on cancellation

3. Fix apply_node_download_progress to match by model_id (src/exo/shared/apply.py)

  • Was matching by full shard_metadata equality — caused duplicate entries when shard type changed
  • Now matches by model_card.model_id, consistent with _model_needs_download and coordinator dict key

4. Add test for model_id-based replacement (src/exo/shared/tests/)

  • Test verifying that download progress with different shard types for the same model replaces rather than appends

Why It Works

_cancel_download() was cancelling the asyncio task but never updating either the coordinator's local download_status dict or the global event-sourced state. After cancellation:

  1. _start_download() sees stale DownloadOngoing in download_status → returns early ("already in progress")
  2. Worker's _model_needs_download() sees stale DownloadOngoing in global state → doesn't create new task
  3. Result: permanently stuck — download never restarts

The fix mirrors the existing cleanup pattern from _delete_download() (which already handles this correctly).

Test Plan

Manual Testing

Automated Testing

  • Added test_apply_download_progress_replaces_when_shard_metadata_type_changes test
  • All 223 existing tests continue to pass

_cancel_download() cancelled the asyncio task but never cleared the
coordinator's local download_status or emitted an event to update global
state. Both remained stuck on DownloadOngoing, causing any new
StartDownload for the same model to be silently skipped.

Also fixes apply_node_download_progress to match by model_id instead of
full shard_metadata equality, preventing duplicate download entries when
instance type changes (e.g. MlxJaccl -> MlxRing).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant