-
Notifications
You must be signed in to change notification settings - Fork 735
Fix SIGTERM forwarding in AWS Batch jobs #6414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: jorgee <[email protected]>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
Has this change been validated? |
I manually tested when I was doing the benchmarks using https://github.com/bentsherman/nf-head-job-benchmark pipeline that executes nextflow inside processes. I didn't find an easy way to add a unit or integration test to validate it, as we should wait until a job is running cancelling and checking the sigterm has been received. I will try again when addressing the changes proposed by cursor. |
Signed-off-by: jorgee <[email protected]>
I've tried this with rnaseq-nf and it's working file. However I didn't validated the trap on failure / termination. I'd suggest to add an |
I have added the echo and tested manually the execution of nf-sleep. The echo message has appeared in a job logs and .command.log. |
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Ok, for clarity I was adding |
Ah , Ok, I thought it was good to keep the echo to log the TERM forwarding to help on debugging future problems. But, I can remove if you think it is not required. |
Signed-off-by: jorgee <[email protected]>
echo removed from trap |
Adds comprehensive comments to s3Cmd and s5Cmd methods explaining the rationale and implementation details of the enhanced signal handling mechanism introduced to fix nested Nextflow execution issues in AWS Batch. The documentation covers: - TERM signal trap for proper signal forwarding - EXIT signal trap for cleanup and log uploading - Background execution pattern for signal-responsive task execution - References to PR #6414 for historical context 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Paolo Di Tommaso <[email protected]>
This is the kind of comments we should aim for fcec7dc 😎 |
Merging this. @jorgee have a look also to Azure and Google Batch. there could be a similar problem |
I think it is not happening in Azure but a similar script than AWS is in google cloud. I will check with the same application |
When executing tasks with AWS Batch, the SIGTERM is not forwarded to the .command.run because another trap is defined in the container command script. It disables the traps in the nested scripts, so when running a task with Nextflow executions inside (like in https://github.com/bentsherman/nf-head-job-benchmark). The Nextflow process inside the task was abruptly killed, leaving zombie jobs.
This PR includes the management of the SIGTERM trap at the container command level to avoid zombie jobs.
1- Captures the pid of the .command.run
2- Adds a trap to forward SIGTERM to the captured pid. The Nextflow process is aborted and cancels the submitted jobs.
Note
Enhances AWS Batch launch command to forward SIGTERM to the task process and run bash in background with real-time tee logging; updates tests accordingly.
AwsBatchExecutor.s3Cmd
/s5Cmd
):trap "[[ -n \$pid ]] && kill -TERM \$pid" TERM
.bash
, capturepid
, andwait
to return correct status.> >(tee .command.log) 2>&1
while preserving EXIT trap S3 upload.plugins/nf-amazon/src/test/nextflow/cloud/aws/batch/AwsBatchTaskHandlerTest.groovy
andplugins/nf-amazon/src/test/nextflow/executor/AwsBatchExecutorTest.groovy
to reflect new command strings.Written by Cursor Bugbot for commit fcec7dc. This will update automatically on new commits. Configure here.