fix: download gets permanently stuck after cancellation by AlexCheema · Pull Request #1614 · exo-explore/exo

AlexCheema · 2026-02-24T19:14:20Z

Motivation

When a model instance is deleted and recreated (or changed between instance types like MlxJaccl ↔ MlxRing), the download for that model gets permanently stuck. The state shows both DownloadPending and DownloadOngoing entries for the same model but with different shard metadata types. Deleting and recreating the instance never triggers a new download — it's stuck forever.

Changes

1. Fix _cancel_download to reset state (src/exo/download/coordinator.py)

Clear download_status so coordinator accepts new StartDownload commands
Emit DownloadPending event to reset global state (only if status was DownloadOngoing)
Clean up _last_progress_time throttle tracking
Split the AND condition into two independent ifs so each cleanup runs regardless of the other

2. Handle CancelledError in download_wrapper (src/exo/download/coordinator.py)

asyncio.CancelledError is a BaseException, not Exception — was silently bypassing cleanup
Now defensively clears download_status on cancellation

3. Fix apply_node_download_progress to match by model_id (src/exo/shared/apply.py)

Was matching by full shard_metadata equality — caused duplicate entries when shard type changed
Now matches by model_card.model_id, consistent with _model_needs_download and coordinator dict key

4. Add test for model_id-based replacement (src/exo/shared/tests/)

Test verifying that download progress with different shard types for the same model replaces rather than appends

Why It Works

_cancel_download() was cancelling the asyncio task but never updating either the coordinator's local download_status dict or the global event-sourced state. After cancellation:

_start_download() sees stale DownloadOngoing in download_status → returns early ("already in progress")
Worker's _model_needs_download() sees stale DownloadOngoing in global state → doesn't create new task
Result: permanently stuck — download never restarts

The fix mirrors the existing cleanup pattern from _delete_download() (which already handles this correctly).

Test Plan

Manual Testing

Automated Testing

Added test_apply_download_progress_replaces_when_shard_metadata_type_changes test
All 223 existing tests continue to pass

_cancel_download() cancelled the asyncio task but never cleared the coordinator's local download_status or emitted an event to update global state. Both remained stuck on DownloadOngoing, causing any new StartDownload for the same model to be silently skipped. Also fixes apply_node_download_progress to match by model_id instead of full shard_metadata equality, preventing duplicate download entries when instance type changes (e.g. MlxJaccl -> MlxRing). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: download gets permanently stuck after cancellation#1614

fix: download gets permanently stuck after cancellation#1614
AlexCheema wants to merge 1 commit intomainfrom
alexcheema/fix-stuck-download-after-cancel

AlexCheema commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AlexCheema commented Feb 24, 2026

Motivation

Changes

Why It Works

Test Plan

Manual Testing

Automated Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant