Handle curl / xz / dd failures and perform retries#696
Handle curl / xz / dd failures and perform retries#696mangelajo merged 2 commits intojumpstarter-dev:mainfrom
Conversation
WalkthroughAdds a FlashError hierarchy, implements top-level retryable flash with exponential backoff, centralizes flash orchestration into _perform_flash_operation, expands progress monitoring and completion detection, and exposes a Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User
participant CLI as CLI (flash)
participant FC as BaseFlasherClient
participant OP as _perform_flash_operation
participant FW as _flash_with_progress
participant DD as dd process
participant MON as _monitor_flash_progress
U->>CLI: flash --retries=N ...
CLI->>FC: flash(path, ..., retries=N)
loop retry up to N (exponential backoff)
FC->>OP: orchestrate setup, network, preflash, SSL, etc.
OP->>FW: start flashing (spawn dd / transfer)
FW->>DD: run dd / transfer process
par monitor progress
DD-->>FW: stdout/stderr/events
FW->>MON: stream logs / accumulated output
MON-->>FW: progress stats & completion markers
end
alt FLASH_COMPLETE
OP-->>FC: success
FC-->>CLI: return success
else ERROR
OP-->>FC: raise error
alt retryable
FC->>FC: wait(backoff) then retry
else non-retryable
FC-->>CLI: raise FlashNonRetryableError / FlashError
end
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧬 Code graph analysis (1)packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
🔇 Additional comments (10)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
✅ Deploy Preview for jumpstarter-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (3)
204-204: Consider re-raising the original exception instead of wrapping in RuntimeError.Wrapping unexpected exceptions in
RuntimeErrorobscures the original exception type, which may be important for debugging or for callers who want to handle specific exception types.Consider this alternative:
- raise RuntimeError(f"Flash operation failed: {e}") from e + # Unexpected error type, re-raise as-is to preserve type information + raise
216-282: Consider refactoring to reduce code duplication.The
_get_retryable_errorand_get_non_retryable_errormethods have nearly identical logic. Consider extracting a generic helper to improve maintainability.Here's a suggested refactor:
def _find_exception_of_type(self, exception: Exception, target_type: type) -> Exception | None: """Find an exception of a specific type in an exception or its causes. Args: exception: The exception to check target_type: The exception type to search for Returns: The matching exception if found, None otherwise """ # Check if this is an ExceptionGroup and look through its exceptions if hasattr(exception, 'exceptions'): for sub_exc in exception.exceptions: result = self._find_exception_of_type(sub_exc, target_type) if result is not None: return result # Check the current exception if isinstance(exception, target_type): return exception # Check the cause chain current = getattr(exception, '__cause__', None) while current is not None: if isinstance(current, target_type): return current # Also check if the cause is an ExceptionGroup if hasattr(current, 'exceptions'): for sub_exc in current.exceptions: result = self._find_exception_of_type(sub_exc, target_type) if result is not None: return result current = getattr(current, '__cause__', None) return None def _get_retryable_error(self, exception: Exception) -> FlashRetryableError | None: """Find a retryable error in an exception (or any of its causes).""" return self._find_exception_of_type(exception, FlashRetryableError) def _get_non_retryable_error(self, exception: Exception) -> FlashNonRetryableError | None: """Find a non-retryable error in an exception (or any of its causes).""" return self._find_exception_of_type(exception, FlashNonRetryableError)
479-496: Use an empty buffer and fallback on missing dd PID
- change
accumulated_output = pidof_outputtoaccumulated_output = ""to avoid false-positive marker matches- instead of raising
FlashNonRetryableErrorwhendd_pidisn’t found, log a warning, setdd_pid = "0", and rely on the existing completion‐marker fallback
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py(9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)
packages/jumpstarter/jumpstarter/common/exceptions.py (2)
ArgumentError(54-57)JumpstarterException(4-39)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Redirect rules - jumpstarter-docs
- GitHub Check: Header rules - jumpstarter-docs
- GitHub Check: Pages changed - jumpstarter-docs
- GitHub Check: build
- GitHub Check: pytest-matrix (macos-15, 3.13)
- GitHub Check: pytest-matrix (macos-15, 3.12)
- GitHub Check: pytest-matrix (macos-15, 3.11)
- GitHub Check: pytest-matrix (ubuntu-24.04, 3.11)
- GitHub Check: pytest-matrix (ubuntu-24.04, 3.12)
- GitHub Check: pytest-matrix (ubuntu-24.04, 3.13)
- GitHub Check: e2e
🔇 Additional comments (6)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (6)
28-37: LGTM! Clean exception hierarchy.The exception design properly extends
JumpstarterExceptionand provides clear separation between retryable and non-retryable errors. This will enable precise error handling in the retry logic.
25-25: LGTM! Proper imports for exception handling.Correctly imports the base exception classes needed for the new exception hierarchy.
284-349: LGTM! Well-structured refactoring.The extraction of flash orchestration into
_perform_flash_operationcleanly separates the retryable operations from the retry logic, making the code easier to understand and maintain.
607-609: LGTM! Correct error classification.Properly classifies curl failures as retryable errors, which makes sense for network-related issues.
1016-1021: LGTM! CLI integration is complete and correct.The
--retriesoption is properly exposed through the CLI with appropriate defaults, and correctly wired through to the flash method.Also applies to: 1035-1035, 1056-1056
460-465: LGTM! Robust pipeline construction.The flash command properly uses
set -o pipefailto ensure any failure in the curl/decompress/dd pipeline is detected. The split string markers ("F""LASH_COMPLETE") cleverly prevent premature matching during command assembly.
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py
Show resolved
Hide resolved
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py
Outdated
Show resolved
Hide resolved
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (3)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (3)
190-190: Cap the exponential backoff delay.The unbounded exponential backoff
2 ** attemptcan grow very large if users specify high retry counts (e.g.,retries=10would cause a ~8.5 minute delay before the final attempt).Apply this diff to cap the backoff at a reasonable maximum:
- time.sleep(2 ** attempt) # Exponential backoff + # Exponential backoff, capped at 60 seconds + time.sleep(min(2 ** attempt, 60))
206-209: Remove unreachable code.This
last_exceptioncheck is unreachable because:
- Retryable errors on the last attempt raise
FlashErrorat lines 194-196- Non-retryable errors raise
FlashErrorimmediately at line 200- Unexpected errors raise
RuntimeErrorat line 204- The only path that reaches line 206 is a successful flash (break at line 176), where
last_exceptionwould beNoneApply this diff to remove the dead code:
- if last_exception: - raise FlashError( - f"Flash operation failed after {retries + 1} attempts. Last error: {last_exception}" - ) from last_exception -
558-564: Fix incorrect comment about buffer size.The comment says "Keep only the last 128 bytes" but the code retains 64KB (64*1024 bytes).
Apply this diff:
def _update_accumulated_output(self, accumulated_output, data): """Update accumulated output with new data, keeping only last 64KB.""" accumulated_output += data - # Keep only the last 128 bytes to prevent memory growth + # Keep only the last 64KB to prevent memory growth if len(accumulated_output) > 64*1024: accumulated_output = accumulated_output[-64*1024:] return accumulated_output
🧹 Nitpick comments (3)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (3)
216-282: Consider extracting common error-finding logic.Both
_get_retryable_errorand_get_non_retryable_errorhave nearly identical implementations, differing only in the exception type they check.Consider refactoring to reduce duplication:
+ def _find_exception_of_type(self, exception: Exception, target_type: type) -> Exception | None: + """Find an exception of a specific type in an exception (or any of its causes). + + Args: + exception: The exception to check + target_type: The exception type to find + + Returns: + The exception of target_type if found, None otherwise + """ + # Check if this is an ExceptionGroup and look through its exceptions + if hasattr(exception, 'exceptions'): + for sub_exc in exception.exceptions: + result = self._find_exception_of_type(sub_exc, target_type) + if result is not None: + return result + + # Check the current exception + if isinstance(exception, target_type): + return exception + + # Check the cause chain + current = getattr(exception, '__cause__', None) + while current is not None: + if isinstance(current, target_type): + return current + # Also check if the cause is an ExceptionGroup + if hasattr(current, 'exceptions'): + for sub_exc in current.exceptions: + result = self._find_exception_of_type(sub_exc, target_type) + if result is not None: + return result + current = getattr(current, '__cause__', None) + return None + def _get_retryable_error(self, exception: Exception) -> FlashRetryableError | None: - """Find a retryable error in an exception (or any of its causes). - - Args: - exception: The exception to check - - Returns: - The FlashRetryableError if found, None otherwise - """ - # Check if this is an ExceptionGroup and look through its exceptions - if hasattr(exception, 'exceptions'): - for sub_exc in exception.exceptions: - result = self._get_retryable_error(sub_exc) - if result is not None: - return result - - # Check the current exception - if isinstance(exception, FlashRetryableError): - return exception - - # Check the cause chain - current = getattr(exception, '__cause__', None) - while current is not None: - if isinstance(current, FlashRetryableError): - return current - # Also check if the cause is an ExceptionGroup - if hasattr(current, 'exceptions'): - for sub_exc in current.exceptions: - result = self._get_retryable_error(sub_exc) - if result is not None: - return result - current = getattr(current, '__cause__', None) - return None + """Find a retryable error in an exception (or any of its causes).""" + return self._find_exception_of_type(exception, FlashRetryableError) def _get_non_retryable_error(self, exception: Exception) -> FlashNonRetryableError | None: - """Find a non-retryable error in an exception (or any of its causes). - - Args: - exception: The exception to check - - Returns: - The FlashNonRetryableError if found, None otherwise - """ - # Check if this is an ExceptionGroup and look through its exceptions - if hasattr(exception, 'exceptions'): - for sub_exc in exception.exceptions: - result = self._get_non_retryable_error(sub_exc) - if result is not None: - return result - - # Check the current exception - if isinstance(exception, FlashNonRetryableError): - return exception - - # Check the cause chain - current = getattr(exception, '__cause__', None) - while current is not None: - if isinstance(current, FlashNonRetryableError): - return current - # Also check if the cause is an ExceptionGroup - if hasattr(current, 'exceptions'): - for sub_exc in current.exceptions: - result = self._get_non_retryable_error(sub_exc) - if result is not None: - return result - current = getattr(current, '__cause__', None) - return None + """Find a non-retryable error in an exception (or any of its causes).""" + return self._find_exception_of_type(exception, FlashNonRetryableError)
482-482: Clarify the initialization comment.The comment says "just in case we get the FLASH_COMPLETE or FLASH_FAILED markers soon", but these markers are emitted by the flash command's echo statements (lines 464), not by
pidof. The initialization withpidof_outputseems to be for continuity of the accumulated buffer rather than expecting markers.Consider clarifying the comment:
- accumulated_output = pidof_output # just in case we get the FLASH_COMPLETE or FLASH_FAILED markers soon + accumulated_output = pidof_output # Initialize buffer with pidof output for continuity
1016-1021: Add validation for retries parameter.The
--retriesoption accepts any integer value, including negative numbers or excessively large values that could cause issues (e.g., waiting hours due to exponential backoff).Consider adding validation:
@click.option( "--retries", type=int, default=3, help="Number of retry attempts for flash operation (default: 3)", ) + @click.option( + "--retries", + type=click.IntRange(min=0, max=10), + default=3, + help="Number of retry attempts for flash operation (default: 3, max: 10)", + )Or add manual validation at the start of the
flashcommand handler.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py(9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)
packages/jumpstarter/jumpstarter/common/exceptions.py (2)
ArgumentError(54-57)JumpstarterException(4-39)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Redirect rules - jumpstarter-docs
- GitHub Check: Header rules - jumpstarter-docs
- GitHub Check: Pages changed - jumpstarter-docs
- GitHub Check: pytest-matrix (ubuntu-24.04, 3.11)
- GitHub Check: pytest-matrix (macos-15, 3.13)
- GitHub Check: pytest-matrix (macos-15, 3.12)
- GitHub Check: pytest-matrix (ubuntu-24.04, 3.13)
- GitHub Check: pytest-matrix (ubuntu-24.04, 3.12)
- GitHub Check: pytest-matrix (macos-15, 3.11)
- GitHub Check: e2e
🔇 Additional comments (11)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (11)
28-37: LGTM!The exception hierarchy is well-structured and follows the pattern established by
JumpstarterException. The distinction between retryable and non-retryable errors is clear and will aid in proper error handling.
284-349: LGTM!The centralization of flash orchestration into
_perform_flash_operationis well-structured. The method properly encapsulates all retry-able operations and maintains clear separation of concerns.
494-496: Consider making "dd process not found" retryable.If
ddfails to start due to a transient issue (e.g., temporary resource exhaustion), raisingFlashNonRetryableErrorprevents retrying what might be a recoverable situation.Should this be
FlashRetryableErrorinstead? Or add logic to distinguish between "dd never started" (retryable) vs. "dd command not found" (non-retryable)?
538-546: LGTM!The completion marker detection is clear and correctly classifies
FLASH_FAILEDas a retryable error, which is appropriate for pipeline failures.
548-556: LGTM!The 5-second grace period for completion markers is reasonable, and classifying the absence of markers as non-retryable is appropriate since it likely indicates a configuration or script issue rather than a transient failure.
566-584: LGTM!The progress tracking logic is clean and efficient, with appropriate 5-second intervals to avoid excessive logging while still providing useful feedback.
607-609: LGTM!Changing to
FlashRetryableErrorfor URL access failures is the right choice, as network issues and temporary unavailability are typically transient conditions that benefit from retry.
459-467: LGTM!The use of
pipefailensures proper error propagation through the pipeline, and splitting the echo strings (e.g.,"F""LASH_COMPLETE") is a good technique to avoid premature pattern matching.
1023-1057: LGTM!The
retriesparameter is properly threaded from the CLI option through to theflash()method call, maintaining consistency throughout the call chain.
135-156: LGTM!Correctly initializing
storage_threadanderror_queuebefore the retry loop ensures that the background transfer only happens once and prevents duplicate downloads on retry attempts.
211-214: LGTM!The timing information with formatted output (minutes:seconds) provides valuable observability for flash operations.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)
190-190: Cap the exponential backoff delay.The unbounded exponential backoff
2 ** attemptcan grow very large if users specify high retry counts (e.g.,retries=10would cause a ~8.5 minute delay before the final attempt).Apply this diff to cap the backoff at a reasonable maximum:
- time.sleep(2 ** attempt) # Exponential backoff + # Exponential backoff, capped at 60 seconds + time.sleep(min(2 ** attempt, 60))
🧹 Nitpick comments (1)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)
544-552: Consider making the completion marker timeout configurable.The 5-second wait after dd finishes (line 548) may be insufficient on slower systems or under heavy load, potentially causing false negatives when the echo command is delayed.
Consider either:
- Increasing the timeout to 10-15 seconds for more margin, or
- Making it configurable via a manifest or constant
For example:
+# Timeout in seconds to wait for completion marker after dd finishes +DD_COMPLETION_MARKER_TIMEOUT = 10 + def _handle_dd_finished(self, dd_finished_time): """Handle case when dd process has finished but no completion marker found yet.""" if dd_finished_time is None: dd_finished_time = time.time() - elif time.time() - dd_finished_time > 5: # Wait up to 5 seconds for echo + elif time.time() - dd_finished_time > DD_COMPLETION_MARKER_TIMEOUT: raise FlashNonRetryableError("Flash operation completed without success/failure marker")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py(9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)
packages/jumpstarter/jumpstarter/common/exceptions.py (2)
ArgumentError(54-57)JumpstarterException(4-39)
🪛 GitHub Actions: Lint
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py
[error] 183-183: Ruff: F841 Local variable 'last_exception' is assigned to but never used.
🪛 GitHub Check: ruff
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py
[failure] 183-183: Ruff (F841)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py:183:25: F841 Local variable last_exception is assigned to but never used
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Redirect rules - jumpstarter-docs
- GitHub Check: Header rules - jumpstarter-docs
- GitHub Check: Pages changed - jumpstarter-docs
- GitHub Check: pytest-matrix (macos-15, 3.11)
- GitHub Check: pytest-matrix (macos-15, 3.13)
- GitHub Check: pytest-matrix (macos-15, 3.12)
- GitHub Check: pytest-matrix (ubuntu-24.04, 3.11)
- GitHub Check: pytest-matrix (ubuntu-24.04, 3.12)
- GitHub Check: pytest-matrix (ubuntu-24.04, 3.13)
- GitHub Check: e2e
🔇 Additional comments (5)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (5)
28-37: LGTM! Well-designed exception hierarchy.The exception hierarchy cleanly separates retryable errors (network, timeout) from non-retryable errors (configuration, filesystem), which enables precise error handling in the retry logic.
98-98: LGTM! Reasonable default retry count.The default of 3 retries (4 total attempts including the initial one) strikes a good balance between resilience and avoiding excessive delays for persistent failures.
212-278: LGTM! Thorough error inspection logic.The recursive error inspection correctly handles both exception chains (
__cause__) andExceptionGrouphierarchies, ensuring retryable/non-retryable errors are detected even when nested or wrapped.
280-345: LGTM! Well-structured flash orchestration.Centralizing all console operations into a single method enables clean retry semantics at the highest level, ensuring that transient console issues (network glitches, DHCP timeouts) can be recovered by retrying the entire setup sequence.
1012-1017: LGTM! Clean CLI integration.The
--retriesoption is properly exposed with clear help text, and the parameter is correctly forwarded from the CLI handler to the core flash method.Also applies to: 1031-1032, 1052-1052
| non_retryable_error = self._get_non_retryable_error(e) | ||
|
|
||
| if retryable_error is not None: | ||
| last_exception = e |
There was a problem hiding this comment.
Remove unused variable.
The variable last_exception is assigned but never used, as flagged by static analysis and pipeline failures. The actual exception e is correctly used in the raise statement at line 196.
Apply this diff:
- last_exception = e
if attempt < retries:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| last_exception = e |
🧰 Tools
🪛 GitHub Actions: Lint
[error] 183-183: Ruff: F841 Local variable 'last_exception' is assigned to but never used.
🪛 GitHub Check: ruff
[failure] 183-183: Ruff (F841)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py:183:25: F841 Local variable last_exception is assigned to but never used
🤖 Prompt for AI Agents
In packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py
around line 183, remove the unused assignment "last_exception = e" since the
variable is never referenced elsewhere and the raised exception uses "e"
directly; simply delete that line to satisfy static analysis and pipeline
checks.
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py
Show resolved
Hide resolved
f7fe892 to
31a1d34
Compare
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin release-0.7
git worktree add -d .worktree/backport-696-to-release-0.7 origin/release-0.7
cd .worktree/backport-696-to-release-0.7
git switch --create backport-696-to-release-0.7
git cherry-pick -x 2b30889b9b7513b12c2ae69dcb7313298b928b07 31a1d346d9ec355a580528160a2612dbe43850f2 |
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin release-0.7
git worktree add -d .worktree/backport-696-to-release-0.7 origin/release-0.7
cd .worktree/backport-696-to-release-0.7
git switch --create backport-696-to-release-0.7
git cherry-pick -x 2b30889b9b7513b12c2ae69dcb7313298b928b07 31a1d346d9ec355a580528160a2612dbe43850f2 |
|
ok, we need this backported first https://github.com/jumpstarter-dev/jumpstarter/pull/697 for this backport to apply |
|
Successfully created backport PR for |
Summary by CodeRabbit
New Features
Refactor
Reliability