[flytekit]: Fix checkpoint chain break by walking back through previous attempts#51
Open
devin-ai-integration[bot] wants to merge 1 commit intomasterfrom
Open
[flytekit]: Fix checkpoint chain break by walking back through previous attempts#51devin-ai-integration[bot] wants to merge 1 commit intomasterfrom
devin-ai-integration[bot] wants to merge 1 commit intomasterfrom
Conversation
…checkpoint
When a pod is killed before writing a checkpoint, the single-link chain
breaks and subsequent retries start from scratch. This change makes
SyncCheckpoint accept a list of previous checkpoint paths and try them
in reverse order (N-1, N-2, ..., 0) until it finds one with data.
On success, the checkpoint is auto-forwarded to the current attempt's
dest path so the next retry can find it without walking back again.
The entrypoint now computes the full list of previous attempt paths from
the deterministic checkpoint path pattern (dn0-{attempt}) and passes
them to SyncCheckpoint.
Co-Authored-By: unknown <>
Author
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
Flytekit's checkpoint system uses a single-link chain: when Flyte Propeller retries a failed pod (attempt N), it passes
--prev-checkpointpointing only at attempt N-1. If N-1 was killed before writing a checkpoint (e.g. during container pull, OOM at init, or beforeckpt.write()ran),restore()fails withFileNotFoundErrorand the new pod starts from scratch — even though earlier attempts (N-2, N-3, …) may have valid checkpoints on S3.Since pod N also starts from step 0, it takes a long time before it writes its own checkpoint, and if it too is killed early, the chain stays broken indefinitely.
What changes were proposed in this pull request?
Two files changed:
flytekit/core/checkpointer.pySyncCheckpoint.__init__now acceptscheckpoint_srcas either a single path (backwards-compatible) or a list of paths.restore()iterates through candidates in order (most-recent-attempt first). For each candidate it attemptsdownload_directory; if the download succeeds and the directory contains files, it returns immediately. Empty directories and download errors are caught and the next candidate is tried._auto_forward()(new): on successful restore, copies the checkpoint tocheckpoint_destso the next retry can find it at attempt N's path without walking back. This is best-effort — failures don't block the restore.flytekit/bin/entrypoint.py_build_checkpoint_src_list()(new): parses the deterministic checkpoint path pattern (<prefix>-dn0-{attempt}/_flytecheckpoints/) to extract the current attempt number, then generates paths for attempts N-1 through 0. The Propeller-providedprev_checkpointis always placed first (its prefix/hash is authoritative for N-1); generated candidates fill in N-2 … 0.setup_execution()now passes the full candidate list toSyncCheckpointinstead of just the singleprev_checkpoint.Key design notes for reviewers
prev_checkpointis prioritized for N-1 since its prefix is guaranteed correct.dn0-{N}pattern, the code falls back to[prev_checkpoint]— identical to the old behavior._auto_forwardwraps the upload in a try/except so a forwarding failure never blocks the restore itself.How was this patch tested?
test_checkpointer.py+test_checkpoint.py)._build_checkpoint_src_listwas verified with inline tests covering: multiple attempts, attempt 0, unparseable paths,Noneinputs, and differing hash prefixes betweenprev_checkpointand generated candidates.SyncCheckpointwas tested with list input (empty-then-good candidates) to confirm walkback works end-to-end locally.Check all the applicable boxes
Link to Devin session: https://app.devin.ai/sessions/bf3c3ef41dff4ee3a3e27cb81ba3d2a1