Skip to content

Conversation

jorgee
Copy link
Contributor

@jorgee jorgee commented Sep 18, 2025

When executing tasks with AWS Batch, the SIGTERM is not forwarded to the .command.run because another trap is defined in the container command script. It disables the traps in the nested scripts, so when running a task with Nextflow executions inside (like in https://github.com/bentsherman/nf-head-job-benchmark). The Nextflow process inside the task was abruptly killed, leaving zombie jobs.

This PR includes the management of the SIGTERM trap at the container command level to avoid zombie jobs.
1- Captures the pid of the .command.run
2- Adds a trap to forward SIGTERM to the captured pid. The Nextflow process is aborted and cancels the submitted jobs.


Note

Enhances AWS Batch launch command to forward SIGTERM to the task process and run bash in background with real-time tee logging; updates tests accordingly.

  • AWS Batch launcher (AwsBatchExecutor.s3Cmd/s5Cmd):
    • Add TERM trap to forward signals: trap "[[ -n \$pid ]] && kill -TERM \$pid" TERM.
    • Execute task via background bash, capture pid, and wait to return correct status.
    • Switch to process-substitution logging: > >(tee .command.log) 2>&1 while preserving EXIT trap S3 upload.
  • Tests:
    • Update expectations in plugins/nf-amazon/src/test/nextflow/cloud/aws/batch/AwsBatchTaskHandlerTest.groovy and plugins/nf-amazon/src/test/nextflow/executor/AwsBatchExecutorTest.groovy to reflect new command strings.

Written by Cursor Bugbot for commit fcec7dc. This will update automatically on new commits. Configure here.

Copy link

netlify bot commented Sep 18, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit fcec7dc
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/68dba58bff45160008ac9140

@jorgee jorgee changed the title Fix signal forwarding in task managed with AWS Batch Fix SIGTERM forwarding in AWS Batch jobs Sep 18, 2025
cursor[bot]

This comment was marked as outdated.

@pditommaso
Copy link
Member

Has this change been validated?

@jorgee
Copy link
Contributor Author

jorgee commented Sep 19, 2025

I manually tested when I was doing the benchmarks using https://github.com/bentsherman/nf-head-job-benchmark pipeline that executes nextflow inside processes. I didn't find an easy way to add a unit or integration test to validate it, as we should wait until a job is running cancelling and checking the sigterm has been received. I will try again when addressing the changes proposed by cursor.

Signed-off-by: jorgee <[email protected]>
@pditommaso
Copy link
Member

I've tried this with rnaseq-nf and it's working file. However I didn't validated the trap on failure / termination.

I'd suggest to add an echo in the trap and validate locally using a long running single job pipeline, and killing it manually

@jorgee
Copy link
Contributor Author

jorgee commented Sep 23, 2025

I have added the echo and tested manually the execution of nf-sleep. The echo message has appeared in a job logs and .command.log.

cursor[bot]

This comment was marked as outdated.

@pditommaso
Copy link
Member

Ok, for clarity I was adding echo only for local debugging. Not to include in the final implementation

@jorgee
Copy link
Contributor Author

jorgee commented Sep 24, 2025

Ah , Ok, I thought it was good to keep the echo to log the TERM forwarding to help on debugging future problems. But, I can remove if you think it is not required.

Signed-off-by: jorgee <[email protected]>
@jorgee
Copy link
Contributor Author

jorgee commented Sep 25, 2025

echo removed from trap

Adds comprehensive comments to s3Cmd and s5Cmd methods explaining the
rationale and implementation details of the enhanced signal handling
mechanism introduced to fix nested Nextflow execution issues in AWS Batch.

The documentation covers:
- TERM signal trap for proper signal forwarding
- EXIT signal trap for cleanup and log uploading
- Background execution pattern for signal-responsive task execution
- References to PR #6414 for historical context

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso
Copy link
Member

This is the kind of comments we should aim for fcec7dc 😎

@pditommaso
Copy link
Member

Merging this. @jorgee have a look also to Azure and Google Batch. there could be a similar problem

@pditommaso pditommaso merged commit abbef79 into master Sep 30, 2025
11 checks passed
@pditommaso pditommaso deleted the fix-nested-nextflow-execution-aws branch September 30, 2025 09:59
@jorgee
Copy link
Contributor Author

jorgee commented Sep 30, 2025

I think it is not happening in Azure but a similar script than AWS is in google cloud. I will check with the same application

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants