-
Notifications
You must be signed in to change notification settings - Fork 717
feat: Azure Batch eagerly terminates jobs after all tasks have been submitted #6159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: Azure Batch eagerly terminates jobs after all tasks have been submitted #6159
Conversation
✅ Deploy Preview for nextflow-docs-staging canceled.
|
This comment was marked as outdated.
This comment was marked as outdated.
plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchProcessObserver.groovy
Outdated
Show resolved
Hide resolved
Integration tests failing, looks unrelated. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
b4b321e
to
069653d
Compare
plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchProcessObserver.groovy
Outdated
Show resolved
Hide resolved
Related issue on Slack: https://nfcore.slack.com/archives/C02T98A23U7/p1753954588096009
|
Hi ! Should we not care about resuming a workflow, since task outputs are stored outside of the workdir, a simple flag to delete the job once the last task has finished it's lifecycle would be all we need. The downside is that to "resume" a workflow, me must find a way to tell Nextflow that task outputs already exist and are stored at X location, as well as supply them. Which leads to the fact that the easiest solution would be to relaunch the whole workflow should we have an error arise somewhere during processing. |
This comment was marked as outdated.
This comment was marked as outdated.
@ghislaindemael I'm not sure I understand this; Task outputs have to be in the working directory. Even if they're published to a new location, they are copied out after the task is completed by Nextflow itself. |
My error, indeed I meant as we publish the results outside of the workdir (e.g. in Blob Storage), we can query them from here and thus delete them from the Batch VMs to remove the load and free up jobs for the quota. |
To clarify the flow here:
This PR strictly refers to 1 and 9 and does not interact with any files. If you are having issues with file storage, running out of space, etc., this would be an different issue. |
@adamrtalbot we are also looking for a solution for this. We have executions that have hundreds of jobs sometimes, so, even in proper executions without errors, we are limited to 2 or 3 parallel executions per batch account. In our case, at any given moment, we would have something like: Run1: Run2: So, from Azure's perspective, we have 200+2+57+4=263 jobs ongoing. As the runs progress, we have more and more jobs open and we reach the limit very quickly. We are seeing if we can modify / extend the |
Not resume, but retry with an errorStrategy: https://www.nextflow.io/docs/latest/reference/process.html#errorstrategy Here is the flow that may cause issues:
|
So using a cron job that validates when all tasks have completed successfully would work, right? In this case, the tasks have already completed. Or can we add it to the bit that runs and "decides" that all the tasks have been completed? |
You might have the same issue, in that you terminate a job before you can resubmit a task. |
Note none of this will help you if you just have too many active jobs. A job needs to be active to run a task, so if you just have a lot of work to do this wont help. Really, the issue is with the terrible design by Azure but given they just fired most of their genomics staff I doubt they will bother to help 🤷 . |
I have too many active jobs because nextflow does not close them, not because they are actually active. Any job that has been marked by nextflow with a ✔ is, in my opinion, finished and will not be re-used for anything never again, however, nextflow does not close it until the full execution of the run is finished. This is the behaviour that I think is incorrect. If the run takes 2 days to run, the first task that finished 47 hours ago is still marked as "active" in batch because Nextflow does not close it even though it will never be used again. I think it is Nextflow that is not using the Batch account properly. |
Right - so with especially long running pipelines you have many jobs in active state which do not do anything. Unfortunately, there isn't a way of Nextflow knowing the future and determining if another task will be submitted to the job which makes it tricky to know when to close a job. Here's an alternative implementation (which has the added benefit of making @pditommaso happy because it wont use the trace observer!):
This should eagerly terminate jobs while still allowing users to submit all tasks as normal. |
Then how does it "decide" to add the ✔? |
The check is shown when no more tasks for that process need to be executed ie. the process execution is complete |
Excellent, can we use that logic to terminate the Azure Job? |
It could be done with a TraceObserver(V2). If i'm not wrong you already made a pr for that |
I have, it's this one but I've modified the behaviour now. @luanjot I've updated the behaviour so now:
The advantage here is:
|
plugins/nf-azure/src/test/nextflow/cloud/azure/config/AzureConfigTest.groovy
Outdated
Show resolved
Hide resolved
…ubmitted Azure Batch "job leak" is still an issue. This commit fixes #5839 which allows Nextflow to set jobs to auto terminate when all tasks have been submitted. This means that eventually jobs will move into terminated state even if something prevents nextflow reaching a graceful shutdown. Very early implementation and needs some refinement. Signed-off-by: adamrtalbot <[email protected]>
Signed-off-by: adamrtalbot <[email protected]>
Signed-off-by: adamrtalbot <[email protected]>
…sOnCompletion; refactor task submission; reduce logging - Remove Azure Batch process observer: - Delete AzBatchProcessObserver, its factory, and tests - Remove registration from META-INF/extensions.idx - Reuse `azure.batch.terminateJobsOnCompletion` to set Azure Batch job `OnAllTasksComplete` (eager auto-termination) - Refactor task submission flow in AzBatchService: - Extract helpers: `submitTaskToJob`, `recreateJobForTask`, `setAutoTerminateIfEnabled` - Handle 409 conflicts by recreating the job and retrying submission Signed-off-by: adamrtalbot <[email protected]>
d201e31
to
69da31c
Compare
Signed-off-by: adamrtalbot <[email protected]>
Signed-off-by: adamrtalbot <[email protected]>
@luanjot and @ghislaindemael, I would appreciate it if you could give this development branch some extra testing, provided you can do a local build of Nextflow. I've ran it through a bunch of pipelines but more UAT is always helpful! |
It looks neat but not getting the general logic. Can you provide some hint ? |
…ogic - Add class-level documentation explaining three-layer auto-termination strategy - Document eager auto-termination approach in setAutoTerminateIfEnabled() method - Explain 409 conflict resolution in submitTaskToJob() method - Document job recreation logic in recreateJobForTask() method - Add comments explaining relationship between eager and graceful shutdown approaches - Add test coverage for auto-termination scenarios while avoiding unmockable Azure SDK classes - Keep comments concise but comprehensive to improve code maintainability Addresses Paolo Di Tommaso's feedback in PR #6159 requesting explanation of the Azure Batch job auto-termination logic to prevent job leak and improve quota cleanup. Signed-off-by: adamrtalbot <[email protected]>
Added some comments in df102da referring to #6159 (comment) |
plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchService.groovy
Outdated
Show resolved
Hide resolved
…rvice.groovy [ci skip] Signed-off-by: Paolo Di Tommaso <[email protected]>
plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchService.groovy
Outdated
Show resolved
Hide resolved
@adamrtalbot i've updated the PR description for you 😎 |
…rvice.groovy [ci skip] Co-authored-by: Adam Talbot <[email protected]> Signed-off-by: Paolo Di Tommaso <[email protected]>
One thing that i', not getting, if the job is created with the flag TERMINATE_JOB on creation OnAllBatchTasksComplete.TERMINATE_JOB why it still needed to update the job with the same flag on completion? |
Good point - it's the previous old code hanging around. On one hand, it's a good idea to go around afterwards and definitely mark them as complete, because it has zero risk and can cause many issues if it isn't done. On the other hand, it's pointlessly doing more API calls. This is only situation I can think of where a job wouldn't be set to autoterminate:
In which case, leaving it in will help. |
From my point of view it would be better to keep only the cleanup on jon creation (why it should fail?). The update on terminate does not scale well with a large number of tasks |
…, which may error Signed-off-by: adamrtalbot <[email protected]>
Remove redundant job termination logic that was executed during pipeline shutdown. Jobs are now configured for auto-termination at creation time, making the shutdown termination unnecessary. Changes: - Remove terminateJobs() method and its call from close() - Remove setAutoTerminateIfEnabled() method - Remove redundant auto-termination calls after task submission - Jobs now terminate automatically when all tasks complete This simplifies the codebase and prevents unnecessary API calls while ensuring jobs don't consume quota after completion. Signed-off-by: adamrtalbot <[email protected]>
Summary
Fixes Azure Batch "job leak" issue where jobs remain in Active state even after task completion, causing quota exhaustion and preventing multiple pipelines from running simultaneously.
Problem: Jobs consume quota slots unnecessarily, blocking other workflows
Solution: Leverage Azure Batch's native auto-termination to release quota immediately when tasks complete
How Azure Batch Eager Job Termination Works
Problem Addressed
Azure Batch has a limitation where jobs remain in an "Active" state even after all their tasks complete. This causes:
Solution Implementation
Job Auto-Termination Configuration
Default behavior:
terminateJobsOnCompletion = true
(enabled by default)Job Termination Mechanism
The service implements a two-phase termination approach:
Phase 1: Set Jobs to Auto-Terminate
Phase 2: Cleanup on Workflow Completion
Azure Batch Native Feature Integration
How Auto-Termination Works
OnAllBatchTasksComplete.TERMINATE_JOB
settingEager Termination Flow
OnAllTasksComplete = TERMINATE_JOB
Key Benefits
Resource Management
Operational Improvements
Configuration Options
Users can control the behavior:
Technical Implementation Details
Job Lifecycle Management
updateJob
API for termination settingError Handling
Impact
This implementation provides an elegant solution to Azure Batch's job quota problem by leveraging Azure's native auto-termination feature. It ensures that jobs automatically terminate when their tasks complete, preventing quota exhaustion while maintaining full compatibility with existing workflows.
Related