Handle curl / xz / dd failures and perform retries by mangelajo · Pull Request #696 · jumpstarter-dev/jumpstarter-python

mangelajo · 2025-10-07T14:20:49Z

Summary by CodeRabbit

New Features
- Added a --retries option to the flash CLI (default 3) so flash operations automatically retry with exponential backoff for transient failures.
- Flash command now accepts and forwards retries to the core flash flow.
Refactor
- Centralized and streamlined the flash flow for consistent setup, execution, progress monitoring, and cleanup.
- Improved error classification (retryable vs non-retryable), added corresponding public exception types, and enhanced progress/completion handling.
Reliability
- More robust waiting for background storage transfers and improved handling of edge-case failures.

coderabbitai · 2025-10-07T14:21:00Z

Walkthrough

Adds a FlashError hierarchy, implements top-level retryable flash with exponential backoff, centralizes flash orchestration into _perform_flash_operation, expands progress monitoring and completion detection, and exposes a --retries CLI option and updated flash(..., retries=...) signature.

Changes

Cohort / File(s)	Summary
Flasher client `packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py`	Added `FlashError`, `FlashRetryableError`, `FlashNonRetryableError`; introduced top-level retry loop with exponential backoff; centralized orchestration into `_perform_flash_operation`; added `_flash_with_progress`, `_monitor_flash_progress`, `_update_accumulated_output`, `_update_progress_stats`, `_check_completion_markers`, `_handle_dd_finished`; added error-inspection helpers for retry classification (including ExceptionGroup); integrated retries into curl/storage flows; updated `flash(...)` signature to accept `retries`.
CLI wiring `.../cli/*` (flash command wiring)	Exposed `--retries` CLI option; updated CLI flash handler signature to accept and forward `retries` to the flasher client's `flash` call.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant CLI as CLI (flash)
  participant FC as BaseFlasherClient
  participant OP as _perform_flash_operation
  participant FW as _flash_with_progress
  participant DD as dd process
  participant MON as _monitor_flash_progress

  U->>CLI: flash --retries=N ...
  CLI->>FC: flash(path, ..., retries=N)
  loop retry up to N (exponential backoff)
    FC->>OP: orchestrate setup, network, preflash, SSL, etc.
    OP->>FW: start flashing (spawn dd / transfer)
    FW->>DD: run dd / transfer process
    par monitor progress
      DD-->>FW: stdout/stderr/events
      FW->>MON: stream logs / accumulated output
      MON-->>FW: progress stats & completion markers
    end
    alt FLASH_COMPLETE
      OP-->>FC: success
      FC-->>CLI: return success
    else ERROR
      OP-->>FC: raise error
      alt retryable
        FC->>FC: wait(backoff) then retry
      else non-retryable
        FC-->>CLI: raise FlashNonRetryableError / FlashError
      end
    end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I press the flash, ears tuned to try,
Retries hop thrice beneath a patient sky.
I count the bytes and watch dd fleet,
Mark COMPLETE, decide retry or retreat—
Then thump! the image settles, snug and spry. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly captures the main enhancement in this changeset, which adds retry logic around failures in curl, xz, and dd operations, and it directly reflects the central focus on handling those failures and performing retries. It is clear, specific, and related to the implemented retry loop and error handling improvements without extraneous detail.
Docstring Coverage	✅ Passed	Docstring coverage is 86.67% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f7fe892 and 31a1d34.

📒 Files selected for processing (1)

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (9 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)

packages/jumpstarter/jumpstarter/common/exceptions.py (2)

ArgumentError (54-57)

JumpstarterException (4-39)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: Redirect rules - jumpstarter-docs
GitHub Check: Header rules - jumpstarter-docs
GitHub Check: Pages changed - jumpstarter-docs
GitHub Check: pytest-matrix (macos-15, 3.12)
GitHub Check: pytest-matrix (ubuntu-24.04, 3.13)
GitHub Check: pytest-matrix (ubuntu-24.04, 3.11)
GitHub Check: pytest-matrix (macos-15, 3.11)
GitHub Check: pytest-matrix (macos-15, 3.13)
GitHub Check: pytest-matrix (ubuntu-24.04, 3.12)
GitHub Check: e2e

🔇 Additional comments (10)

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (10)

28-37: Well-designed exception hierarchy.

The three-tier exception hierarchy (FlashError → FlashRetryableError/FlashNonRetryableError) properly inherits from JumpstarterException and provides clear semantics for retry logic. This matches the existing exception patterns in the codebase (see jumpstarter.common.exceptions).

167-202: Solid retry orchestration.

The retry loop correctly implements the retry logic with proper error classification. The loop runs retries + 1 total attempts, breaks on success, and appropriately handles retryable vs non-retryable errors. The error messages clearly indicate the attempt count and failure reason.

Note: The exponential backoff at line 188 should be capped as previously flagged.

210-276: Thorough error classification logic.

The recursive inspection of exception chains (including ExceptionGroup and __cause__) ensures that retryable/non-retryable errors are correctly identified even when wrapped by multiple layers of exception handling. The symmetric implementation for both error types is maintainable.

278-343: Clean extraction of retryable flash operations.

Centralizing the console setup and flash sequence into _perform_flash_operation provides a clean boundary for retry logic. This keeps the retry orchestration simple while maintaining the detailed flash workflow.

470-530: Comprehensive progress monitoring.

The progress monitoring logic correctly handles multiple states: active dd process, finished dd process, and completion markers. The accumulation of output ensures completion markers are detected even if they arrive in separate reads.

Note: The dd PID extraction logic (lines 478-492) has a known issue with space-separated PIDs that was previously flagged.

532-540: Appropriate error classification for pipeline failures.

Treating FLASH_FAILED as retryable aligns with the PR's objective to handle transient curl/xz/dd failures. While not all pipeline failures are truly retryable (e.g., corrupted image), the bounded retry count provides a reasonable safety net without over-engineering error classification.

542-578: Well-factored helper methods.

The three helper methods cleanly separate concerns: handling dd completion, managing memory with bounded output accumulation, and rate-limiting progress logs. Each method has a single, clear responsibility.

601-603: Improved error classification for URL access failures.

Raising FlashRetryableError for URL access failures (instead of a generic RuntimeError) correctly enables retries for transient network issues. This is a key improvement for resilience.

1010-1051: Clean CLI integration.

The --retries option is well-integrated with clear help text and a reasonable default. The parameter is correctly wired through from CLI to the core flash method.

454-458: Verify pipefail support on target busybox shell
Busybox ash may not support set -o pipefail, so curl or decompression failures could be ignored. Confirm your DUT’s shell supports pipefail or add explicit error checks between each command.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

netlify · 2025-10-07T14:21:37Z

✅ Deploy Preview for jumpstarter-docs ready!

Name	Link
🔨 Latest commit	`31a1d34`
🔍 Latest deploy log	https://app.netlify.com/projects/jumpstarter-docs/deploys/68e6412c6c279d0008b27199
😎 Deploy Preview	https://deploy-preview-696--jumpstarter-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (3)

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (3)

204-204: Consider re-raising the original exception instead of wrapping in RuntimeError.

Wrapping unexpected exceptions in RuntimeError obscures the original exception type, which may be important for debugging or for callers who want to handle specific exception types.

Consider this alternative:

-                    raise RuntimeError(f"Flash operation failed: {e}") from e
+                    # Unexpected error type, re-raise as-is to preserve type information
+                    raise

216-282: Consider refactoring to reduce code duplication.

The _get_retryable_error and _get_non_retryable_error methods have nearly identical logic. Consider extracting a generic helper to improve maintainability.

Here's a suggested refactor:

def _find_exception_of_type(self, exception: Exception, target_type: type) -> Exception | None:
    """Find an exception of a specific type in an exception or its causes.
    
    Args:
        exception: The exception to check
        target_type: The exception type to search for
    
    Returns:
        The matching exception if found, None otherwise
    """
    # Check if this is an ExceptionGroup and look through its exceptions
    if hasattr(exception, 'exceptions'):
        for sub_exc in exception.exceptions:
            result = self._find_exception_of_type(sub_exc, target_type)
            if result is not None:
                return result
    
    # Check the current exception
    if isinstance(exception, target_type):
        return exception
    
    # Check the cause chain
    current = getattr(exception, '__cause__', None)
    while current is not None:
        if isinstance(current, target_type):
            return current
        # Also check if the cause is an ExceptionGroup
        if hasattr(current, 'exceptions'):
            for sub_exc in current.exceptions:
                result = self._find_exception_of_type(sub_exc, target_type)
                if result is not None:
                    return result
        current = getattr(current, '__cause__', None)
    return None

def _get_retryable_error(self, exception: Exception) -> FlashRetryableError | None:
    """Find a retryable error in an exception (or any of its causes)."""
    return self._find_exception_of_type(exception, FlashRetryableError)

def _get_non_retryable_error(self, exception: Exception) -> FlashNonRetryableError | None:
    """Find a non-retryable error in an exception (or any of its causes)."""
    return self._find_exception_of_type(exception, FlashNonRetryableError)

479-496: Use an empty buffer and fallback on missing dd PID

change accumulated_output = pidof_output to accumulated_output = "" to avoid false-positive marker matches
instead of raising FlashNonRetryableError when dd_pid isn’t found, log a warning, set dd_pid = "0", and rely on the existing completion‐marker fallback

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96077c6 and db95e3b.

📒 Files selected for processing (1)

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (9 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)

packages/jumpstarter/jumpstarter/common/exceptions.py (2)

ArgumentError (54-57)

JumpstarterException (4-39)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: Redirect rules - jumpstarter-docs
GitHub Check: Header rules - jumpstarter-docs
GitHub Check: Pages changed - jumpstarter-docs
GitHub Check: build
GitHub Check: pytest-matrix (macos-15, 3.13)
GitHub Check: pytest-matrix (macos-15, 3.12)
GitHub Check: pytest-matrix (macos-15, 3.11)
GitHub Check: pytest-matrix (ubuntu-24.04, 3.11)
GitHub Check: pytest-matrix (ubuntu-24.04, 3.12)
GitHub Check: pytest-matrix (ubuntu-24.04, 3.13)
GitHub Check: e2e

🔇 Additional comments (6)

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (6)

28-37: LGTM! Clean exception hierarchy.

The exception design properly extends JumpstarterException and provides clear separation between retryable and non-retryable errors. This will enable precise error handling in the retry logic.

25-25: LGTM! Proper imports for exception handling.

Correctly imports the base exception classes needed for the new exception hierarchy.

284-349: LGTM! Well-structured refactoring.

The extraction of flash orchestration into _perform_flash_operation cleanly separates the retryable operations from the retry logic, making the code easier to understand and maintain.

607-609: LGTM! Correct error classification.

Properly classifies curl failures as retryable errors, which makes sense for network-related issues.

1016-1021: LGTM! CLI integration is complete and correct.

The --retries option is properly exposed through the CLI with appropriate defaults, and correctly wired through to the flash method.

Also applies to: 1035-1035, 1056-1056

460-465: LGTM! Robust pipeline construction.

The flash command properly uses set -o pipefail to ensure any failure in the curl/decompress/dd pipeline is detected. The split string markers ("F""LASH_COMPLETE") cleverly prevent premature matching during command assembly.

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (3)
190-190: Cap the exponential backoff delay.

The unbounded exponential backoff 2 ** attempt can grow very large if users specify high retry counts (e.g., retries=10 would cause a ~8.5 minute delay before the final attempt).

Apply this diff to cap the backoff at a reasonable maximum:
-                        time.sleep(2 ** attempt)  # Exponential backoff
+                        # Exponential backoff, capped at 60 seconds
+                        time.sleep(min(2 ** attempt, 60))
206-209: Remove unreachable code.

This last_exception check is unreachable because:

Retryable errors on the last attempt raise FlashError at lines 194-196

Non-retryable errors raise FlashError immediately at line 200

Unexpected errors raise RuntimeError at line 204

The only path that reaches line 206 is a successful flash (break at line 176), where last_exception would be None

Apply this diff to remove the dead code:
-        if last_exception:
-            raise FlashError(
-                f"Flash operation failed after {retries + 1} attempts. Last error: {last_exception}"
-            ) from last_exception
-
558-564: Fix incorrect comment about buffer size.

The comment says "Keep only the last 128 bytes" but the code retains 64KB (64*1024 bytes).

Apply this diff:
     def _update_accumulated_output(self, accumulated_output, data):
         """Update accumulated output with new data, keeping only last 64KB."""
         accumulated_output += data
-        # Keep only the last 128 bytes to prevent memory growth
+        # Keep only the last 64KB to prevent memory growth
         if len(accumulated_output) > 64*1024:
             accumulated_output = accumulated_output[-64*1024:]
         return accumulated_output

🧹 Nitpick comments (3)

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (3)

216-282: Consider extracting common error-finding logic.

Both _get_retryable_error and _get_non_retryable_error have nearly identical implementations, differing only in the exception type they check.

Consider refactoring to reduce duplication:

+    def _find_exception_of_type(self, exception: Exception, target_type: type) -> Exception | None:
+        """Find an exception of a specific type in an exception (or any of its causes).
+
+        Args:
+            exception: The exception to check
+            target_type: The exception type to find
+
+        Returns:
+            The exception of target_type if found, None otherwise
+        """
+        # Check if this is an ExceptionGroup and look through its exceptions
+        if hasattr(exception, 'exceptions'):
+            for sub_exc in exception.exceptions:
+                result = self._find_exception_of_type(sub_exc, target_type)
+                if result is not None:
+                    return result
+
+        # Check the current exception
+        if isinstance(exception, target_type):
+            return exception
+
+        # Check the cause chain
+        current = getattr(exception, '__cause__', None)
+        while current is not None:
+            if isinstance(current, target_type):
+                return current
+            # Also check if the cause is an ExceptionGroup
+            if hasattr(current, 'exceptions'):
+                for sub_exc in current.exceptions:
+                    result = self._find_exception_of_type(sub_exc, target_type)
+                    if result is not None:
+                        return result
+            current = getattr(current, '__cause__', None)
+        return None
+
     def _get_retryable_error(self, exception: Exception) -> FlashRetryableError | None:
-        """Find a retryable error in an exception (or any of its causes).
-
-        Args:
-            exception: The exception to check
-
-        Returns:
-            The FlashRetryableError if found, None otherwise
-        """
-        # Check if this is an ExceptionGroup and look through its exceptions
-        if hasattr(exception, 'exceptions'):
-            for sub_exc in exception.exceptions:
-                result = self._get_retryable_error(sub_exc)
-                if result is not None:
-                    return result
-
-        # Check the current exception
-        if isinstance(exception, FlashRetryableError):
-            return exception
-
-        # Check the cause chain
-        current = getattr(exception, '__cause__', None)
-        while current is not None:
-            if isinstance(current, FlashRetryableError):
-                return current
-            # Also check if the cause is an ExceptionGroup
-            if hasattr(current, 'exceptions'):
-                for sub_exc in current.exceptions:
-                    result = self._get_retryable_error(sub_exc)
-                    if result is not None:
-                        return result
-            current = getattr(current, '__cause__', None)
-        return None
+        """Find a retryable error in an exception (or any of its causes)."""
+        return self._find_exception_of_type(exception, FlashRetryableError)
 
     def _get_non_retryable_error(self, exception: Exception) -> FlashNonRetryableError | None:
-        """Find a non-retryable error in an exception (or any of its causes).
-
-        Args:
-            exception: The exception to check
-
-        Returns:
-            The FlashNonRetryableError if found, None otherwise
-        """
-        # Check if this is an ExceptionGroup and look through its exceptions
-        if hasattr(exception, 'exceptions'):
-            for sub_exc in exception.exceptions:
-                result = self._get_non_retryable_error(sub_exc)
-                if result is not None:
-                    return result
-
-        # Check the current exception
-        if isinstance(exception, FlashNonRetryableError):
-            return exception
-
-        # Check the cause chain
-        current = getattr(exception, '__cause__', None)
-        while current is not None:
-            if isinstance(current, FlashNonRetryableError):
-                return current
-            # Also check if the cause is an ExceptionGroup
-            if hasattr(current, 'exceptions'):
-                for sub_exc in current.exceptions:
-                    result = self._get_non_retryable_error(sub_exc)
-                    if result is not None:
-                        return result
-            current = getattr(current, '__cause__', None)
-        return None
+        """Find a non-retryable error in an exception (or any of its causes)."""
+        return self._find_exception_of_type(exception, FlashNonRetryableError)

482-482: Clarify the initialization comment.

The comment says "just in case we get the FLASH_COMPLETE or FLASH_FAILED markers soon", but these markers are emitted by the flash command's echo statements (lines 464), not by pidof. The initialization with pidof_output seems to be for continuity of the accumulated buffer rather than expecting markers.

Consider clarifying the comment:

-        accumulated_output = pidof_output # just in case we get the FLASH_COMPLETE or FLASH_FAILED markers soon
+        accumulated_output = pidof_output  # Initialize buffer with pidof output for continuity

1016-1021: Add validation for retries parameter.

The --retries option accepts any integer value, including negative numbers or excessively large values that could cause issues (e.g., waiting hours due to exponential backoff).

Consider adding validation:

         @click.option(
             "--retries",
             type=int,
             default=3,
             help="Number of retry attempts for flash operation (default: 3)",
         )
+        @click.option(
+            "--retries",
+            type=click.IntRange(min=0, max=10),
+            default=3,
+            help="Number of retry attempts for flash operation (default: 3, max: 10)",
+        )

Or add manual validation at the start of the flash command handler.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between db95e3b and 299313b.

📒 Files selected for processing (1)

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (9 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)

packages/jumpstarter/jumpstarter/common/exceptions.py (2)

ArgumentError (54-57)

JumpstarterException (4-39)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: Redirect rules - jumpstarter-docs
GitHub Check: Header rules - jumpstarter-docs
GitHub Check: Pages changed - jumpstarter-docs
GitHub Check: pytest-matrix (ubuntu-24.04, 3.11)
GitHub Check: pytest-matrix (macos-15, 3.13)
GitHub Check: pytest-matrix (macos-15, 3.12)
GitHub Check: pytest-matrix (ubuntu-24.04, 3.13)
GitHub Check: pytest-matrix (ubuntu-24.04, 3.12)
GitHub Check: pytest-matrix (macos-15, 3.11)
GitHub Check: e2e

🔇 Additional comments (11)

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (11)

28-37: LGTM!

The exception hierarchy is well-structured and follows the pattern established by JumpstarterException. The distinction between retryable and non-retryable errors is clear and will aid in proper error handling.

284-349: LGTM!

The centralization of flash orchestration into _perform_flash_operation is well-structured. The method properly encapsulates all retry-able operations and maintains clear separation of concerns.

494-496: Consider making "dd process not found" retryable.

If dd fails to start due to a transient issue (e.g., temporary resource exhaustion), raising FlashNonRetryableError prevents retrying what might be a recoverable situation.

Should this be FlashRetryableError instead? Or add logic to distinguish between "dd never started" (retryable) vs. "dd command not found" (non-retryable)?

538-546: LGTM!

The completion marker detection is clear and correctly classifies FLASH_FAILED as a retryable error, which is appropriate for pipeline failures.

548-556: LGTM!

The 5-second grace period for completion markers is reasonable, and classifying the absence of markers as non-retryable is appropriate since it likely indicates a configuration or script issue rather than a transient failure.

566-584: LGTM!

The progress tracking logic is clean and efficient, with appropriate 5-second intervals to avoid excessive logging while still providing useful feedback.

607-609: LGTM!

Changing to FlashRetryableError for URL access failures is the right choice, as network issues and temporary unavailability are typically transient conditions that benefit from retry.

459-467: LGTM!

The use of pipefail ensures proper error propagation through the pipeline, and splitting the echo strings (e.g., "F""LASH_COMPLETE") is a good technique to avoid premature pattern matching.

1023-1057: LGTM!

The retries parameter is properly threaded from the CLI option through to the flash() method call, maintaining consistency throughout the call chain.

135-156: LGTM!

Correctly initializing storage_thread and error_queue before the retry loop ensures that the background transfer only happens once and prevents duplicate downloads on retry attempts.

211-214: LGTM!

The timing information with formatted output (minutes:seconds) provides valuable observability for flash operations.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)
190-190: Cap the exponential backoff delay.

The unbounded exponential backoff 2 ** attempt can grow very large if users specify high retry counts (e.g., retries=10 would cause a ~8.5 minute delay before the final attempt).

Apply this diff to cap the backoff at a reasonable maximum:
-                            time.sleep(2 ** attempt)  # Exponential backoff
+                            # Exponential backoff, capped at 60 seconds
+                            time.sleep(min(2 ** attempt, 60))

🧹 Nitpick comments (1)

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)
544-552: Consider making the completion marker timeout configurable.

The 5-second wait after dd finishes (line 548) may be insufficient on slower systems or under heavy load, potentially causing false negatives when the echo command is delayed.

Consider either:

Increasing the timeout to 10-15 seconds for more margin, or

Making it configurable via a manifest or constant

For example:
+# Timeout in seconds to wait for completion marker after dd finishes
+DD_COMPLETION_MARKER_TIMEOUT = 10
+
 def _handle_dd_finished(self, dd_finished_time):
     """Handle case when dd process has finished but no completion marker found yet."""
     if dd_finished_time is None:
         dd_finished_time = time.time()
-    elif time.time() - dd_finished_time > 5:  # Wait up to 5 seconds for echo
+    elif time.time() - dd_finished_time > DD_COMPLETION_MARKER_TIMEOUT:
         raise FlashNonRetryableError("Flash operation completed without success/failure marker")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 299313b and f7fe892.

📒 Files selected for processing (1)

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (9 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)

packages/jumpstarter/jumpstarter/common/exceptions.py (2)

ArgumentError (54-57)

JumpstarterException (4-39)

🪛 GitHub Actions: Lint

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py

[error] 183-183: Ruff: F841 Local variable 'last_exception' is assigned to but never used.

🪛 GitHub Check: ruff

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py

[failure] 183-183: Ruff (F841)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py:183:25: F841 Local variable last_exception is assigned to but never used

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: Redirect rules - jumpstarter-docs
GitHub Check: Header rules - jumpstarter-docs
GitHub Check: Pages changed - jumpstarter-docs
GitHub Check: pytest-matrix (macos-15, 3.11)
GitHub Check: pytest-matrix (macos-15, 3.13)
GitHub Check: pytest-matrix (macos-15, 3.12)
GitHub Check: pytest-matrix (ubuntu-24.04, 3.11)
GitHub Check: pytest-matrix (ubuntu-24.04, 3.12)
GitHub Check: pytest-matrix (ubuntu-24.04, 3.13)
GitHub Check: e2e

🔇 Additional comments (5)

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (5)

28-37: LGTM! Well-designed exception hierarchy.

The exception hierarchy cleanly separates retryable errors (network, timeout) from non-retryable errors (configuration, filesystem), which enables precise error handling in the retry logic.

98-98: LGTM! Reasonable default retry count.

The default of 3 retries (4 total attempts including the initial one) strikes a good balance between resilience and avoiding excessive delays for persistent failures.

212-278: LGTM! Thorough error inspection logic.

The recursive error inspection correctly handles both exception chains (__cause__) and ExceptionGroup hierarchies, ensuring retryable/non-retryable errors are detected even when nested or wrapped.

280-345: LGTM! Well-structured flash orchestration.

Centralizing all console operations into a single method enables clean retry semantics at the highest level, ensuring that transient console issues (network glitches, DHCP timeouts) can be recovered by retrying the entire setup sequence.

1012-1017: LGTM! Clean CLI integration.

The --retries option is properly exposed with clear help text, and the parameter is correctly forwarded from the CLI handler to the core flash method.

Also applies to: 1031-1032, 1052-1052

coderabbitai · 2025-10-08T10:46:19Z

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py

+                    non_retryable_error = self._get_non_retryable_error(e)
+
+                    if retryable_error is not None:
+                        last_exception = e


⚠️ Potential issue | 🟡 Minor

Remove unused variable.

The variable last_exception is assigned but never used, as flagged by static analysis and pipeline failures. The actual exception e is correctly used in the raise statement at line 196.

Apply this diff:

- last_exception = e if attempt < retries:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

last_exception = e

🧰 Tools

🪛 GitHub Actions: Lint

[error] 183-183: Ruff: F841 Local variable 'last_exception' is assigned to but never used.

🪛 GitHub Check: ruff

[failure] 183-183: Ruff (F841)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py:183:25: F841 Local variable last_exception is assigned to but never used

🤖 Prompt for AI Agents

In packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py around line 183, remove the unused assignment "last_exception = e" since the variable is never referenced elsewhere and the raised exception uses "e" directly; simply delete that line to satisfy static analysis and pipeline checks.

packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py

jumpstarter-backport-bot · 2025-10-08T12:24:19Z

Backport failed for release-0.7, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release-0.7
git worktree add -d .worktree/backport-696-to-release-0.7 origin/release-0.7
cd .worktree/backport-696-to-release-0.7
git switch --create backport-696-to-release-0.7
git cherry-pick -x 2b30889b9b7513b12c2ae69dcb7313298b928b07 31a1d346d9ec355a580528160a2612dbe43850f2

jumpstarter-backport-bot · 2025-10-08T12:42:11Z

Backport failed for release-0.7, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release-0.7
git worktree add -d .worktree/backport-696-to-release-0.7 origin/release-0.7
cd .worktree/backport-696-to-release-0.7
git switch --create backport-696-to-release-0.7
git cherry-pick -x 2b30889b9b7513b12c2ae69dcb7313298b928b07 31a1d346d9ec355a580528160a2612dbe43850f2

mangelajo · 2025-10-08T12:51:56Z

ok, we need this backported first https://github.com/jumpstarter-dev/jumpstarter/pull/697 for this backport to apply

jumpstarter-backport-bot · 2025-10-09T11:21:57Z

Successfully created backport PR for release-0.7:

[Backport release-0.7] Handle curl / xz / dd failures and perform retries #703

Implement more robust flash pipeline

2b30889

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

kirkbrauer approved these changes Oct 7, 2025

View reviewed changes

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

mangelajo enabled auto-merge October 8, 2025 10:42

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

Handle retries

31a1d34

mangelajo force-pushed the curl-can-fail branch from f7fe892 to 31a1d34 Compare October 8, 2025 10:47

mangelajo merged commit 6373b76 into jumpstarter-dev:main Oct 8, 2025
18 checks passed

mangelajo added the backport release-0.7 label Oct 8, 2025

mangelajo added backport release-0.7 and removed backport release-0.7 labels Oct 8, 2025

mangelajo added backport release-0.7 and removed backport release-0.7 labels Oct 9, 2025

jumpstarter-backport-bot bot mentioned this pull request Oct 9, 2025

[Backport release-0.7] Handle curl / xz / dd failures and perform retries #703

Merged

coderabbitai bot mentioned this pull request Oct 15, 2025

refactor exception display #707

Open

coderabbitai bot mentioned this pull request Nov 4, 2025

flashers: make dhcp failure flash retryable #730

Merged

coderabbitai bot mentioned this pull request Nov 14, 2025

flasher: make unknown exceptions retriable #746

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle curl / xz / dd failures and perform retries#696

Handle curl / xz / dd failures and perform retries#696
mangelajo merged 2 commits intojumpstarter-dev:mainfrom
mangelajo:curl-can-fail

mangelajo commented Oct 7, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 7, 2025 •

edited

Loading

Uh oh!

netlify bot commented Oct 7, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 8, 2025

Uh oh!

Uh oh!

Uh oh!

jumpstarter-backport-bot bot commented Oct 8, 2025

Uh oh!

jumpstarter-backport-bot bot commented Oct 8, 2025

Uh oh!

mangelajo commented Oct 8, 2025

Uh oh!

jumpstarter-backport-bot bot commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mangelajo commented Oct 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

netlify bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for jumpstarter-docs ready!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jumpstarter-backport-bot bot commented Oct 8, 2025

Uh oh!

jumpstarter-backport-bot bot commented Oct 8, 2025

Uh oh!

mangelajo commented Oct 8, 2025

Uh oh!

jumpstarter-backport-bot bot commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mangelajo commented Oct 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 7, 2025 •

edited

Loading

netlify bot commented Oct 7, 2025 •

edited

Loading