Skip to content
This repository was archived by the owner on Jan 23, 2026. It is now read-only.

Handle curl / xz / dd failures and perform retries#696

Merged
mangelajo merged 2 commits intojumpstarter-dev:mainfrom
mangelajo:curl-can-fail
Oct 8, 2025
Merged

Handle curl / xz / dd failures and perform retries#696
mangelajo merged 2 commits intojumpstarter-dev:mainfrom
mangelajo:curl-can-fail

Conversation

@mangelajo
Copy link
Copy Markdown
Member

@mangelajo mangelajo commented Oct 7, 2025

Summary by CodeRabbit

  • New Features

    • Added a --retries option to the flash CLI (default 3) so flash operations automatically retry with exponential backoff for transient failures.
    • Flash command now accepts and forwards retries to the core flash flow.
  • Refactor

    • Centralized and streamlined the flash flow for consistent setup, execution, progress monitoring, and cleanup.
    • Improved error classification (retryable vs non-retryable), added corresponding public exception types, and enhanced progress/completion handling.
  • Reliability

    • More robust waiting for background storage transfers and improved handling of edge-case failures.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Oct 7, 2025

Walkthrough

Adds a FlashError hierarchy, implements top-level retryable flash with exponential backoff, centralizes flash orchestration into _perform_flash_operation, expands progress monitoring and completion detection, and exposes a --retries CLI option and updated flash(..., retries=...) signature.

Changes

Cohort / File(s) Summary
Flasher client
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py
Added FlashError, FlashRetryableError, FlashNonRetryableError; introduced top-level retry loop with exponential backoff; centralized orchestration into _perform_flash_operation; added _flash_with_progress, _monitor_flash_progress, _update_accumulated_output, _update_progress_stats, _check_completion_markers, _handle_dd_finished; added error-inspection helpers for retry classification (including ExceptionGroup); integrated retries into curl/storage flows; updated flash(...) signature to accept retries.
CLI wiring
.../cli/* (flash command wiring)
Exposed --retries CLI option; updated CLI flash handler signature to accept and forward retries to the flasher client's flash call.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant CLI as CLI (flash)
  participant FC as BaseFlasherClient
  participant OP as _perform_flash_operation
  participant FW as _flash_with_progress
  participant DD as dd process
  participant MON as _monitor_flash_progress

  U->>CLI: flash --retries=N ...
  CLI->>FC: flash(path, ..., retries=N)
  loop retry up to N (exponential backoff)
    FC->>OP: orchestrate setup, network, preflash, SSL, etc.
    OP->>FW: start flashing (spawn dd / transfer)
    FW->>DD: run dd / transfer process
    par monitor progress
      DD-->>FW: stdout/stderr/events
      FW->>MON: stream logs / accumulated output
      MON-->>FW: progress stats & completion markers
    end
    alt FLASH_COMPLETE
      OP-->>FC: success
      FC-->>CLI: return success
    else ERROR
      OP-->>FC: raise error
      alt retryable
        FC->>FC: wait(backoff) then retry
      else non-retryable
        FC-->>CLI: raise FlashNonRetryableError / FlashError
      end
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I press the flash, ears tuned to try,
Retries hop thrice beneath a patient sky.
I count the bytes and watch dd fleet,
Mark COMPLETE, decide retry or retreat—
Then thump! the image settles, snug and spry. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly captures the main enhancement in this changeset, which adds retry logic around failures in curl, xz, and dd operations, and it directly reflects the central focus on handling those failures and performing retries. It is clear, specific, and related to the implemented retry loop and error handling improvements without extraneous detail.
Docstring Coverage ✅ Passed Docstring coverage is 86.67% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f7fe892 and 31a1d34.

📒 Files selected for processing (1)
  • packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)
packages/jumpstarter/jumpstarter/common/exceptions.py (2)
  • ArgumentError (54-57)
  • JumpstarterException (4-39)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Redirect rules - jumpstarter-docs
  • GitHub Check: Header rules - jumpstarter-docs
  • GitHub Check: Pages changed - jumpstarter-docs
  • GitHub Check: pytest-matrix (macos-15, 3.12)
  • GitHub Check: pytest-matrix (ubuntu-24.04, 3.13)
  • GitHub Check: pytest-matrix (ubuntu-24.04, 3.11)
  • GitHub Check: pytest-matrix (macos-15, 3.11)
  • GitHub Check: pytest-matrix (macos-15, 3.13)
  • GitHub Check: pytest-matrix (ubuntu-24.04, 3.12)
  • GitHub Check: e2e
🔇 Additional comments (10)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (10)

28-37: Well-designed exception hierarchy.

The three-tier exception hierarchy (FlashError → FlashRetryableError/FlashNonRetryableError) properly inherits from JumpstarterException and provides clear semantics for retry logic. This matches the existing exception patterns in the codebase (see jumpstarter.common.exceptions).


167-202: Solid retry orchestration.

The retry loop correctly implements the retry logic with proper error classification. The loop runs retries + 1 total attempts, breaks on success, and appropriately handles retryable vs non-retryable errors. The error messages clearly indicate the attempt count and failure reason.

Note: The exponential backoff at line 188 should be capped as previously flagged.


210-276: Thorough error classification logic.

The recursive inspection of exception chains (including ExceptionGroup and __cause__) ensures that retryable/non-retryable errors are correctly identified even when wrapped by multiple layers of exception handling. The symmetric implementation for both error types is maintainable.


278-343: Clean extraction of retryable flash operations.

Centralizing the console setup and flash sequence into _perform_flash_operation provides a clean boundary for retry logic. This keeps the retry orchestration simple while maintaining the detailed flash workflow.


470-530: Comprehensive progress monitoring.

The progress monitoring logic correctly handles multiple states: active dd process, finished dd process, and completion markers. The accumulation of output ensures completion markers are detected even if they arrive in separate reads.

Note: The dd PID extraction logic (lines 478-492) has a known issue with space-separated PIDs that was previously flagged.


532-540: Appropriate error classification for pipeline failures.

Treating FLASH_FAILED as retryable aligns with the PR's objective to handle transient curl/xz/dd failures. While not all pipeline failures are truly retryable (e.g., corrupted image), the bounded retry count provides a reasonable safety net without over-engineering error classification.


542-578: Well-factored helper methods.

The three helper methods cleanly separate concerns: handling dd completion, managing memory with bounded output accumulation, and rate-limiting progress logs. Each method has a single, clear responsibility.


601-603: Improved error classification for URL access failures.

Raising FlashRetryableError for URL access failures (instead of a generic RuntimeError) correctly enables retries for transient network issues. This is a key improvement for resilience.


1010-1051: Clean CLI integration.

The --retries option is well-integrated with clear help text and a reasonable default. The parameter is correctly wired through from CLI to the core flash method.


454-458: Verify pipefail support on target busybox shell
Busybox ash may not support set -o pipefail, so curl or decompression failures could be ignored. Confirm your DUT’s shell supports pipefail or add explicit error checks between each command.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@netlify
Copy link
Copy Markdown

netlify bot commented Oct 7, 2025

Deploy Preview for jumpstarter-docs ready!

Name Link
🔨 Latest commit 31a1d34
🔍 Latest deploy log https://app.netlify.com/projects/jumpstarter-docs/deploys/68e6412c6c279d0008b27199
😎 Deploy Preview https://deploy-preview-696--jumpstarter-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (3)

204-204: Consider re-raising the original exception instead of wrapping in RuntimeError.

Wrapping unexpected exceptions in RuntimeError obscures the original exception type, which may be important for debugging or for callers who want to handle specific exception types.

Consider this alternative:

-                    raise RuntimeError(f"Flash operation failed: {e}") from e
+                    # Unexpected error type, re-raise as-is to preserve type information
+                    raise

216-282: Consider refactoring to reduce code duplication.

The _get_retryable_error and _get_non_retryable_error methods have nearly identical logic. Consider extracting a generic helper to improve maintainability.

Here's a suggested refactor:

def _find_exception_of_type(self, exception: Exception, target_type: type) -> Exception | None:
    """Find an exception of a specific type in an exception or its causes.
    
    Args:
        exception: The exception to check
        target_type: The exception type to search for
    
    Returns:
        The matching exception if found, None otherwise
    """
    # Check if this is an ExceptionGroup and look through its exceptions
    if hasattr(exception, 'exceptions'):
        for sub_exc in exception.exceptions:
            result = self._find_exception_of_type(sub_exc, target_type)
            if result is not None:
                return result
    
    # Check the current exception
    if isinstance(exception, target_type):
        return exception
    
    # Check the cause chain
    current = getattr(exception, '__cause__', None)
    while current is not None:
        if isinstance(current, target_type):
            return current
        # Also check if the cause is an ExceptionGroup
        if hasattr(current, 'exceptions'):
            for sub_exc in current.exceptions:
                result = self._find_exception_of_type(sub_exc, target_type)
                if result is not None:
                    return result
        current = getattr(current, '__cause__', None)
    return None

def _get_retryable_error(self, exception: Exception) -> FlashRetryableError | None:
    """Find a retryable error in an exception (or any of its causes)."""
    return self._find_exception_of_type(exception, FlashRetryableError)

def _get_non_retryable_error(self, exception: Exception) -> FlashNonRetryableError | None:
    """Find a non-retryable error in an exception (or any of its causes)."""
    return self._find_exception_of_type(exception, FlashNonRetryableError)

479-496: Use an empty buffer and fallback on missing dd PID

  • change accumulated_output = pidof_output to accumulated_output = "" to avoid false-positive marker matches
  • instead of raising FlashNonRetryableError when dd_pid isn’t found, log a warning, set dd_pid = "0", and rely on the existing completion‐marker fallback
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96077c6 and db95e3b.

📒 Files selected for processing (1)
  • packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)
packages/jumpstarter/jumpstarter/common/exceptions.py (2)
  • ArgumentError (54-57)
  • JumpstarterException (4-39)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Redirect rules - jumpstarter-docs
  • GitHub Check: Header rules - jumpstarter-docs
  • GitHub Check: Pages changed - jumpstarter-docs
  • GitHub Check: build
  • GitHub Check: pytest-matrix (macos-15, 3.13)
  • GitHub Check: pytest-matrix (macos-15, 3.12)
  • GitHub Check: pytest-matrix (macos-15, 3.11)
  • GitHub Check: pytest-matrix (ubuntu-24.04, 3.11)
  • GitHub Check: pytest-matrix (ubuntu-24.04, 3.12)
  • GitHub Check: pytest-matrix (ubuntu-24.04, 3.13)
  • GitHub Check: e2e
🔇 Additional comments (6)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (6)

28-37: LGTM! Clean exception hierarchy.

The exception design properly extends JumpstarterException and provides clear separation between retryable and non-retryable errors. This will enable precise error handling in the retry logic.


25-25: LGTM! Proper imports for exception handling.

Correctly imports the base exception classes needed for the new exception hierarchy.


284-349: LGTM! Well-structured refactoring.

The extraction of flash orchestration into _perform_flash_operation cleanly separates the retryable operations from the retry logic, making the code easier to understand and maintain.


607-609: LGTM! Correct error classification.

Properly classifies curl failures as retryable errors, which makes sense for network-related issues.


1016-1021: LGTM! CLI integration is complete and correct.

The --retries option is properly exposed through the CLI with appropriate defaults, and correctly wired through to the flash method.

Also applies to: 1035-1035, 1056-1056


460-465: LGTM! Robust pipeline construction.

The flash command properly uses set -o pipefail to ensure any failure in the curl/decompress/dd pipeline is detected. The split string markers ("F""LASH_COMPLETE") cleverly prevent premature matching during command assembly.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (3)

190-190: Cap the exponential backoff delay.

The unbounded exponential backoff 2 ** attempt can grow very large if users specify high retry counts (e.g., retries=10 would cause a ~8.5 minute delay before the final attempt).

Apply this diff to cap the backoff at a reasonable maximum:

-                        time.sleep(2 ** attempt)  # Exponential backoff
+                        # Exponential backoff, capped at 60 seconds
+                        time.sleep(min(2 ** attempt, 60))

206-209: Remove unreachable code.

This last_exception check is unreachable because:

  • Retryable errors on the last attempt raise FlashError at lines 194-196
  • Non-retryable errors raise FlashError immediately at line 200
  • Unexpected errors raise RuntimeError at line 204
  • The only path that reaches line 206 is a successful flash (break at line 176), where last_exception would be None

Apply this diff to remove the dead code:

-        if last_exception:
-            raise FlashError(
-                f"Flash operation failed after {retries + 1} attempts. Last error: {last_exception}"
-            ) from last_exception
-

558-564: Fix incorrect comment about buffer size.

The comment says "Keep only the last 128 bytes" but the code retains 64KB (64*1024 bytes).

Apply this diff:

     def _update_accumulated_output(self, accumulated_output, data):
         """Update accumulated output with new data, keeping only last 64KB."""
         accumulated_output += data
-        # Keep only the last 128 bytes to prevent memory growth
+        # Keep only the last 64KB to prevent memory growth
         if len(accumulated_output) > 64*1024:
             accumulated_output = accumulated_output[-64*1024:]
         return accumulated_output
🧹 Nitpick comments (3)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (3)

216-282: Consider extracting common error-finding logic.

Both _get_retryable_error and _get_non_retryable_error have nearly identical implementations, differing only in the exception type they check.

Consider refactoring to reduce duplication:

+    def _find_exception_of_type(self, exception: Exception, target_type: type) -> Exception | None:
+        """Find an exception of a specific type in an exception (or any of its causes).
+
+        Args:
+            exception: The exception to check
+            target_type: The exception type to find
+
+        Returns:
+            The exception of target_type if found, None otherwise
+        """
+        # Check if this is an ExceptionGroup and look through its exceptions
+        if hasattr(exception, 'exceptions'):
+            for sub_exc in exception.exceptions:
+                result = self._find_exception_of_type(sub_exc, target_type)
+                if result is not None:
+                    return result
+
+        # Check the current exception
+        if isinstance(exception, target_type):
+            return exception
+
+        # Check the cause chain
+        current = getattr(exception, '__cause__', None)
+        while current is not None:
+            if isinstance(current, target_type):
+                return current
+            # Also check if the cause is an ExceptionGroup
+            if hasattr(current, 'exceptions'):
+                for sub_exc in current.exceptions:
+                    result = self._find_exception_of_type(sub_exc, target_type)
+                    if result is not None:
+                        return result
+            current = getattr(current, '__cause__', None)
+        return None
+
     def _get_retryable_error(self, exception: Exception) -> FlashRetryableError | None:
-        """Find a retryable error in an exception (or any of its causes).
-
-        Args:
-            exception: The exception to check
-
-        Returns:
-            The FlashRetryableError if found, None otherwise
-        """
-        # Check if this is an ExceptionGroup and look through its exceptions
-        if hasattr(exception, 'exceptions'):
-            for sub_exc in exception.exceptions:
-                result = self._get_retryable_error(sub_exc)
-                if result is not None:
-                    return result
-
-        # Check the current exception
-        if isinstance(exception, FlashRetryableError):
-            return exception
-
-        # Check the cause chain
-        current = getattr(exception, '__cause__', None)
-        while current is not None:
-            if isinstance(current, FlashRetryableError):
-                return current
-            # Also check if the cause is an ExceptionGroup
-            if hasattr(current, 'exceptions'):
-                for sub_exc in current.exceptions:
-                    result = self._get_retryable_error(sub_exc)
-                    if result is not None:
-                        return result
-            current = getattr(current, '__cause__', None)
-        return None
+        """Find a retryable error in an exception (or any of its causes)."""
+        return self._find_exception_of_type(exception, FlashRetryableError)
 
     def _get_non_retryable_error(self, exception: Exception) -> FlashNonRetryableError | None:
-        """Find a non-retryable error in an exception (or any of its causes).
-
-        Args:
-            exception: The exception to check
-
-        Returns:
-            The FlashNonRetryableError if found, None otherwise
-        """
-        # Check if this is an ExceptionGroup and look through its exceptions
-        if hasattr(exception, 'exceptions'):
-            for sub_exc in exception.exceptions:
-                result = self._get_non_retryable_error(sub_exc)
-                if result is not None:
-                    return result
-
-        # Check the current exception
-        if isinstance(exception, FlashNonRetryableError):
-            return exception
-
-        # Check the cause chain
-        current = getattr(exception, '__cause__', None)
-        while current is not None:
-            if isinstance(current, FlashNonRetryableError):
-                return current
-            # Also check if the cause is an ExceptionGroup
-            if hasattr(current, 'exceptions'):
-                for sub_exc in current.exceptions:
-                    result = self._get_non_retryable_error(sub_exc)
-                    if result is not None:
-                        return result
-            current = getattr(current, '__cause__', None)
-        return None
+        """Find a non-retryable error in an exception (or any of its causes)."""
+        return self._find_exception_of_type(exception, FlashNonRetryableError)

482-482: Clarify the initialization comment.

The comment says "just in case we get the FLASH_COMPLETE or FLASH_FAILED markers soon", but these markers are emitted by the flash command's echo statements (lines 464), not by pidof. The initialization with pidof_output seems to be for continuity of the accumulated buffer rather than expecting markers.

Consider clarifying the comment:

-        accumulated_output = pidof_output # just in case we get the FLASH_COMPLETE or FLASH_FAILED markers soon
+        accumulated_output = pidof_output  # Initialize buffer with pidof output for continuity

1016-1021: Add validation for retries parameter.

The --retries option accepts any integer value, including negative numbers or excessively large values that could cause issues (e.g., waiting hours due to exponential backoff).

Consider adding validation:

         @click.option(
             "--retries",
             type=int,
             default=3,
             help="Number of retry attempts for flash operation (default: 3)",
         )
+        @click.option(
+            "--retries",
+            type=click.IntRange(min=0, max=10),
+            default=3,
+            help="Number of retry attempts for flash operation (default: 3, max: 10)",
+        )

Or add manual validation at the start of the flash command handler.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between db95e3b and 299313b.

📒 Files selected for processing (1)
  • packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)
packages/jumpstarter/jumpstarter/common/exceptions.py (2)
  • ArgumentError (54-57)
  • JumpstarterException (4-39)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Redirect rules - jumpstarter-docs
  • GitHub Check: Header rules - jumpstarter-docs
  • GitHub Check: Pages changed - jumpstarter-docs
  • GitHub Check: pytest-matrix (ubuntu-24.04, 3.11)
  • GitHub Check: pytest-matrix (macos-15, 3.13)
  • GitHub Check: pytest-matrix (macos-15, 3.12)
  • GitHub Check: pytest-matrix (ubuntu-24.04, 3.13)
  • GitHub Check: pytest-matrix (ubuntu-24.04, 3.12)
  • GitHub Check: pytest-matrix (macos-15, 3.11)
  • GitHub Check: e2e
🔇 Additional comments (11)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (11)

28-37: LGTM!

The exception hierarchy is well-structured and follows the pattern established by JumpstarterException. The distinction between retryable and non-retryable errors is clear and will aid in proper error handling.


284-349: LGTM!

The centralization of flash orchestration into _perform_flash_operation is well-structured. The method properly encapsulates all retry-able operations and maintains clear separation of concerns.


494-496: Consider making "dd process not found" retryable.

If dd fails to start due to a transient issue (e.g., temporary resource exhaustion), raising FlashNonRetryableError prevents retrying what might be a recoverable situation.

Should this be FlashRetryableError instead? Or add logic to distinguish between "dd never started" (retryable) vs. "dd command not found" (non-retryable)?


538-546: LGTM!

The completion marker detection is clear and correctly classifies FLASH_FAILED as a retryable error, which is appropriate for pipeline failures.


548-556: LGTM!

The 5-second grace period for completion markers is reasonable, and classifying the absence of markers as non-retryable is appropriate since it likely indicates a configuration or script issue rather than a transient failure.


566-584: LGTM!

The progress tracking logic is clean and efficient, with appropriate 5-second intervals to avoid excessive logging while still providing useful feedback.


607-609: LGTM!

Changing to FlashRetryableError for URL access failures is the right choice, as network issues and temporary unavailability are typically transient conditions that benefit from retry.


459-467: LGTM!

The use of pipefail ensures proper error propagation through the pipeline, and splitting the echo strings (e.g., "F""LASH_COMPLETE") is a good technique to avoid premature pattern matching.


1023-1057: LGTM!

The retries parameter is properly threaded from the CLI option through to the flash() method call, maintaining consistency throughout the call chain.


135-156: LGTM!

Correctly initializing storage_thread and error_queue before the retry loop ensures that the background transfer only happens once and prevents duplicate downloads on retry attempts.


211-214: LGTM!

The timing information with formatted output (minutes:seconds) provides valuable observability for flash operations.

@mangelajo mangelajo enabled auto-merge October 8, 2025 10:42
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)

190-190: Cap the exponential backoff delay.

The unbounded exponential backoff 2 ** attempt can grow very large if users specify high retry counts (e.g., retries=10 would cause a ~8.5 minute delay before the final attempt).

Apply this diff to cap the backoff at a reasonable maximum:

-                            time.sleep(2 ** attempt)  # Exponential backoff
+                            # Exponential backoff, capped at 60 seconds
+                            time.sleep(min(2 ** attempt, 60))
🧹 Nitpick comments (1)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)

544-552: Consider making the completion marker timeout configurable.

The 5-second wait after dd finishes (line 548) may be insufficient on slower systems or under heavy load, potentially causing false negatives when the echo command is delayed.

Consider either:

  1. Increasing the timeout to 10-15 seconds for more margin, or
  2. Making it configurable via a manifest or constant

For example:

+# Timeout in seconds to wait for completion marker after dd finishes
+DD_COMPLETION_MARKER_TIMEOUT = 10
+
 def _handle_dd_finished(self, dd_finished_time):
     """Handle case when dd process has finished but no completion marker found yet."""
     if dd_finished_time is None:
         dd_finished_time = time.time()
-    elif time.time() - dd_finished_time > 5:  # Wait up to 5 seconds for echo
+    elif time.time() - dd_finished_time > DD_COMPLETION_MARKER_TIMEOUT:
         raise FlashNonRetryableError("Flash operation completed without success/failure marker")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 299313b and f7fe892.

📒 Files selected for processing (1)
  • packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (1)
packages/jumpstarter/jumpstarter/common/exceptions.py (2)
  • ArgumentError (54-57)
  • JumpstarterException (4-39)
🪛 GitHub Actions: Lint
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py

[error] 183-183: Ruff: F841 Local variable 'last_exception' is assigned to but never used.

🪛 GitHub Check: ruff
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py

[failure] 183-183: Ruff (F841)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py:183:25: F841 Local variable last_exception is assigned to but never used

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Redirect rules - jumpstarter-docs
  • GitHub Check: Header rules - jumpstarter-docs
  • GitHub Check: Pages changed - jumpstarter-docs
  • GitHub Check: pytest-matrix (macos-15, 3.11)
  • GitHub Check: pytest-matrix (macos-15, 3.13)
  • GitHub Check: pytest-matrix (macos-15, 3.12)
  • GitHub Check: pytest-matrix (ubuntu-24.04, 3.11)
  • GitHub Check: pytest-matrix (ubuntu-24.04, 3.12)
  • GitHub Check: pytest-matrix (ubuntu-24.04, 3.13)
  • GitHub Check: e2e
🔇 Additional comments (5)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py (5)

28-37: LGTM! Well-designed exception hierarchy.

The exception hierarchy cleanly separates retryable errors (network, timeout) from non-retryable errors (configuration, filesystem), which enables precise error handling in the retry logic.


98-98: LGTM! Reasonable default retry count.

The default of 3 retries (4 total attempts including the initial one) strikes a good balance between resilience and avoiding excessive delays for persistent failures.


212-278: LGTM! Thorough error inspection logic.

The recursive error inspection correctly handles both exception chains (__cause__) and ExceptionGroup hierarchies, ensuring retryable/non-retryable errors are detected even when nested or wrapped.


280-345: LGTM! Well-structured flash orchestration.

Centralizing all console operations into a single method enables clean retry semantics at the highest level, ensuring that transient console issues (network glitches, DHCP timeouts) can be recovered by retrying the entire setup sequence.


1012-1017: LGTM! Clean CLI integration.

The --retries option is properly exposed with clear help text, and the parameter is correctly forwarded from the CLI handler to the core flash method.

Also applies to: 1031-1032, 1052-1052

non_retryable_error = self._get_non_retryable_error(e)

if retryable_error is not None:
last_exception = e
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused variable.

The variable last_exception is assigned but never used, as flagged by static analysis and pipeline failures. The actual exception e is correctly used in the raise statement at line 196.

Apply this diff:

-                        last_exception = e
                         if attempt < retries:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
last_exception = e
🧰 Tools
🪛 GitHub Actions: Lint

[error] 183-183: Ruff: F841 Local variable 'last_exception' is assigned to but never used.

🪛 GitHub Check: ruff

[failure] 183-183: Ruff (F841)
packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py:183:25: F841 Local variable last_exception is assigned to but never used

🤖 Prompt for AI Agents
In packages/jumpstarter-driver-flashers/jumpstarter_driver_flashers/client.py
around line 183, remove the unused assignment "last_exception = e" since the
variable is never referenced elsewhere and the raised exception uses "e"
directly; simply delete that line to satisfy static analysis and pipeline
checks.

@mangelajo mangelajo merged commit 6373b76 into jumpstarter-dev:main Oct 8, 2025
18 checks passed
@jumpstarter-backport-bot
Copy link
Copy Markdown

Backport failed for release-0.7, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release-0.7
git worktree add -d .worktree/backport-696-to-release-0.7 origin/release-0.7
cd .worktree/backport-696-to-release-0.7
git switch --create backport-696-to-release-0.7
git cherry-pick -x 2b30889b9b7513b12c2ae69dcb7313298b928b07 31a1d346d9ec355a580528160a2612dbe43850f2

@jumpstarter-backport-bot
Copy link
Copy Markdown

Backport failed for release-0.7, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release-0.7
git worktree add -d .worktree/backport-696-to-release-0.7 origin/release-0.7
cd .worktree/backport-696-to-release-0.7
git switch --create backport-696-to-release-0.7
git cherry-pick -x 2b30889b9b7513b12c2ae69dcb7313298b928b07 31a1d346d9ec355a580528160a2612dbe43850f2

@mangelajo
Copy link
Copy Markdown
Member Author

ok, we need this backported first https://github.com/jumpstarter-dev/jumpstarter/pull/697 for this backport to apply

@jumpstarter-backport-bot
Copy link
Copy Markdown

Successfully created backport PR for release-0.7:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants