Skip to content

Conversation

@munishchouhan
Copy link
Collaborator

@munishchouhan munishchouhan commented Nov 24, 2025

Summary

This PR adds tracking and reporting of spot/preemptible instance interruptions for cloud batch executors (AWS Batch and Google Batch). When tasks are retried due to spot instance interruptions, the number of interruptions is now captured and exposed via the numSpotInterruptions field in trace records.

Motivation

Spot/preemptible instances can be reclaimed by cloud providers at any time, causing tasks to retry on new instances. Understanding how often this happens is important for:

  • Workflow optimization and cost analysis
  • Identifying tasks that frequently experience spot interruptions
  • Monitoring the reliability of spot instance usage
  • Debugging workflow issues related to instance interruptions

Changes

Core Framework

  • TraceRecord (modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy)
    • Added numSpotInterruptions transient field with getter/setter methods
    • Field is accessible in trace records and can be consumed by trace observers

AWS Batch Plugin (nf-amazon)

  • AwsBatchTaskHandler.groovy

    • Added getNumSpotInterruptions(String jobId) method that examines job attempts for spot interruption patterns
    • Detects AWS Batch spot interruptions by checking if statusReason starts with "Host EC2"
    • Returns count of spot interruptions or null if unavailable
    • Updates getTraceRecord() to populate numSpotInterruptions field
  • Tests (AwsBatchTaskHandlerTest.groovy)

    • Added comprehensive test coverage for getNumSpotInterruptions() with various scenarios:
      • No interruptions (0 attempts, empty attempts)
      • Single interruption
      • Multiple interruptions
      • Mixed with non-spot failures
    • Added test verifying trace record integration

Google Batch Plugin (nf-google)

  • GoogleBatchTaskHandler.groovy

    • Added getNumSpotInterruptions(String jobId) method that examines task status events
    • Detects Google Batch spot preemptions by checking for exit code 50001 in status events
    • Returns count of spot preemptions or null if unavailable
    • Updates getTraceRecord() to populate numSpotInterruptions field
    • Implements maxSpotAttempts() helper using FusionConfig defaults when fusion snapshots enabled
  • Tests (GoogleBatchTaskHandlerTest.groovy)

    • Added parameterized test for getNumSpotInterruptions() covering multiple scenarios
    • Added test verifying trace record integration
    • Verified count correctly extracted from status events

Technical Details

Detection Mechanisms

AWS Batch:

  • Examines JobDetail.attempts() list
  • Identifies spot reclamations by checking if attempt.statusReason() starts with "Host EC2"
  • Example pattern: "Host EC2 (instance i-xxx) terminated."

Google Batch:

  • Examines TaskStatus.statusEventsList()
  • Identifies spot preemptions by checking for exitCode == 50001 in task execution events
  • Exit code 50001 is Google Batch's special code for spot preemption

Implementation Approach

The numSpotInterruptions field is:

  1. Stored in TraceRecord as a transient field (not serialized to .command.trace files)
  2. Computed on-demand from cloud provider APIs when getTraceRecord() is called
  3. Available to trace observers for reporting and metrics collection
  4. Returns null if the count cannot be determined (e.g., job not found, API error)

This approach queries the cloud provider's job/task status to detect spot interruptions based on provider-specific indicators:

  • AWS Batch: Status reasons starting with "Host EC2"
  • Google Batch: Status events with exit code 50001

The field will be available to trace observers that consume TraceRecord objects, allowing workflows to track and report spot interruption rates.

Testing

  • ✅ All existing tests pass
  • ✅ New unit tests for spot reclamation counting logic
  • ✅ Integration tests for trace record generation
  • ✅ Verified trace file format compatibility

@netlify
Copy link

netlify bot commented Nov 24, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 90fb949
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/69414be29580ee000858917f

@munishchouhan munishchouhan requested review from stefanoboriero and removed request for fntlnz, jordeu and stefanoboriero November 27, 2025 14:45
@pditommaso pditommaso marked this pull request as draft November 28, 2025 09:22
@munishchouhan munishchouhan marked this pull request as ready for review December 11, 2025 21:21
@pditommaso
Copy link
Member

@claude review please

@claude
Copy link

claude bot commented Dec 15, 2025

Claude finished @pditommaso's task —— View job


Reviewing PR: Added number of reclamations in TraceRecord

Review Tasks:

  • Review core TraceRecord changes
  • Examine AWS Batch plugin implementation
  • Examine Google Batch plugin implementation
  • Review Tower plugin integration
  • Check test coverage and quality
  • Assess overall design and implementation
  • Provide comprehensive feedback

🔍 Code Review Summary

This PR successfully implements spot instance interruption tracking for AWS Batch and Google Batch. The implementation is well-designed and follows established patterns in the codebase. Here's my detailed analysis:

Strengths

  1. Clean Architecture: The implementation properly uses transient fields in TraceRecord rather than adding to the core FIELDS map, following the established pattern for machine metadata.

  2. Robust Error Handling: Both AWS and Google implementations include proper exception handling with debug logging when spot interruption counting fails.

  3. Comprehensive Test Coverage: Both plugins have extensive unit tests covering edge cases like null inputs, exceptions, and various interruption scenarios.

  4. Consistent Naming: The field was appropriately renamed from num_reclamations to numSpotInterruptions which is more descriptive and technically accurate.

  5. Proper Detection Logic:

    • AWS: Correctly identifies spot interruptions using "Host EC2" prefix in statusReason
    • Google: Accurately detects preemptions using exit code 50001

🔧 Technical Implementation Review

Core TraceRecord Changes (modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy:124)

transient private Integer numSpotInterruptions
  • ✅ Correctly uses Integer (nullable) vs int to distinguish between "no data" (null) vs "zero interruptions" (0)
  • ✅ Proper transient field following machine info pattern
  • ✅ Includes appropriate getter/setter methods

AWS Batch Implementation (plugins/nf-amazon/src/main/nextflow/cloud/aws/batch/AwsBatchTaskHandler.groovy:927)

protected Integer getNumSpotInterruptions(String jobId) {
    if (jobId && isCompleted()) {
        // Examines job.attempts() for "Host EC2" status reasons
    }
}
  • ✅ Isolated into separate method as requested in review feedback
  • ✅ Proper null checks and completion validation
  • ✅ Accurate detection pattern for AWS Batch spot interruptions

Google Batch Implementation (plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy:681)

protected Integer getNumSpotInterruptions(String jobId) {
    if (jobId && taskId && isCompleted()) {
        // Examines statusEventsList for exit code 50001
    }
}
  • ✅ Correctly checks for Google Batch's specific preemption exit code (50001)
  • ✅ Proper validation of required parameters (jobId, taskId)
  • ✅ Follows same pattern as AWS implementation

Tower Integration (plugins/nf-tower/src/main/io/seqera/tower/plugin/TowerClient.groovy:663)

record.numSpotInterruptions = trace.getNumSpotInterruptions()
  • ✅ Simple, clean integration following existing pattern for transient fields

🧪 Test Quality Assessment

Both AWS and Google plugins have excellent test coverage:

  • Edge Cases: Tests handle null inputs, exceptions, incomplete tasks
  • Detection Logic: Tests verify accurate counting with various scenarios
  • Integration: Tests confirm proper TraceRecord integration

⚠️ Minor Observations

  1. Documentation: The PR description mentions trace file persistence for Google Batch, but I notice the current implementation only sets the field on the TraceRecord object. This seems intentional based on reviewer feedback.

  2. Field Availability: As noted in review comments, this field is primarily intended for Platform/Tower reporting rather than general trace file output.

  3. Performance: Both implementations only execute when isCompleted() returns true, minimizing performance impact during task execution.

🎯 Final Assessment

APPROVED

This is a well-implemented feature that:

  • Addresses a real operational need for spot instance monitoring
  • Follows established codebase patterns and conventions
  • Includes comprehensive error handling and testing
  • Has been refined through multiple review cycles
  • Properly integrates with the Tower platform for metrics reporting

The implementation demonstrates good engineering practices and should provide valuable operational insights for users running workloads on spot instances.


@munishchouhan
Copy link
Collaborator Author

updated pr summary

@munishchouhan munishchouhan changed the title Added number of reclamations in TraceRecord Added number of spot interruptions in TraceRecord for aws and google batch Dec 15, 2025
@munishchouhan munishchouhan changed the title Added number of spot interruptions in TraceRecord for aws and google batch Feat: Added number of spot interruptions in TraceRecord for aws and google batch Dec 15, 2025
@munishchouhan munishchouhan changed the title Feat: Added number of spot interruptions in TraceRecord for aws and google batch feat: Added number of spot interruptions in TraceRecord for aws and google batch Dec 15, 2025
munishchouhan and others added 2 commits December 16, 2025 12:39
- Use guard clauses in AWS Batch handler for cleaner flow
- Add clarifying comment in Google Batch handler

Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso pditommaso changed the title feat: Added number of spot interruptions in TraceRecord for aws and google batch feat: Added number of spot interruptions to Tower/Platform telemetry Dec 16, 2025
@pditommaso pditommaso merged commit eecd816 into master Dec 16, 2025
14 checks passed
@pditommaso pditommaso deleted the add-num-reclamations-trace branch December 16, 2025 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants