Skip to content

[flytekit]: Fix checkpoint chain break by walking back through previous attempts#51

Open
devin-ai-integration[bot] wants to merge 1 commit intomasterfrom
devin/1776044045-checkpoint-walkback
Open

[flytekit]: Fix checkpoint chain break by walking back through previous attempts#51
devin-ai-integration[bot] wants to merge 1 commit intomasterfrom
devin/1776044045-checkpoint-walkback

Conversation

@devin-ai-integration
Copy link
Copy Markdown

Why are the changes needed?

Flytekit's checkpoint system uses a single-link chain: when Flyte Propeller retries a failed pod (attempt N), it passes --prev-checkpoint pointing only at attempt N-1. If N-1 was killed before writing a checkpoint (e.g. during container pull, OOM at init, or before ckpt.write() ran), restore() fails with FileNotFoundError and the new pod starts from scratch — even though earlier attempts (N-2, N-3, …) may have valid checkpoints on S3.

Since pod N also starts from step 0, it takes a long time before it writes its own checkpoint, and if it too is killed early, the chain stays broken indefinitely.

What changes were proposed in this pull request?

Two files changed:

flytekit/core/checkpointer.py

  • SyncCheckpoint.__init__ now accepts checkpoint_src as either a single path (backwards-compatible) or a list of paths.
  • restore() iterates through candidates in order (most-recent-attempt first). For each candidate it attempts download_directory; if the download succeeds and the directory contains files, it returns immediately. Empty directories and download errors are caught and the next candidate is tried.
  • _auto_forward() (new): on successful restore, copies the checkpoint to checkpoint_dest so the next retry can find it at attempt N's path without walking back. This is best-effort — failures don't block the restore.

flytekit/bin/entrypoint.py

  • _build_checkpoint_src_list() (new): parses the deterministic checkpoint path pattern (<prefix>-dn0-{attempt}/_flytecheckpoints/) to extract the current attempt number, then generates paths for attempts N-1 through 0. The Propeller-provided prev_checkpoint is always placed first (its prefix/hash is authoritative for N-1); generated candidates fill in N-2 … 0.
  • setup_execution() now passes the full candidate list to SyncCheckpoint instead of just the single prev_checkpoint.

Key design notes for reviewers

  1. Hash prefix caveat: The hash in the checkpoint path may differ per attempt since Propeller computes it. Generated candidates for N-2, N-3, etc. reuse the current attempt's prefix as a best guess. If the hash does differ, those candidates will 404 gracefully and be skipped. The Propeller-provided prev_checkpoint is prioritized for N-1 since its prefix is guaranteed correct.
  2. Fallback to single path: If the checkpoint path doesn't match the expected dn0-{N} pattern, the code falls back to [prev_checkpoint] — identical to the old behavior.
  3. Auto-forward is best-effort: _auto_forward wraps the upload in a try/except so a forwarding failure never blocks the restore itself.

How was this patch tested?

  • All 13 existing checkpoint unit tests pass (test_checkpointer.py + test_checkpoint.py).
  • _build_checkpoint_src_list was verified with inline tests covering: multiple attempts, attempt 0, unparseable paths, None inputs, and differing hash prefixes between prev_checkpoint and generated candidates.
  • SyncCheckpoint was tested with list input (empty-then-good candidates) to confirm walkback works end-to-end locally.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Link to Devin session: https://app.devin.ai/sessions/bf3c3ef41dff4ee3a3e27cb81ba3d2a1

…checkpoint

When a pod is killed before writing a checkpoint, the single-link chain
breaks and subsequent retries start from scratch. This change makes
SyncCheckpoint accept a list of previous checkpoint paths and try them
in reverse order (N-1, N-2, ..., 0) until it finds one with data.

On success, the checkpoint is auto-forwarded to the current attempt's
dest path so the next retry can find it without walking back again.

The entrypoint now computes the full list of previous attempt paths from
the deterministic checkpoint path pattern (dn0-{attempt}) and passes
them to SyncCheckpoint.

Co-Authored-By: unknown <>
@devin-ai-integration
Copy link
Copy Markdown
Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants