-
Notifications
You must be signed in to change notification settings - Fork 751
Enhance TaskArrayCollector to support configurable timeout-based array submission #6647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
johnoooh
wants to merge
15
commits into
nextflow-io:master
Choose a base branch
from
johnoooh:feature/slurm_array_timer
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+75
−9
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add defensive copy of task array before passing to createTaskArray() to prevent concurrent modification during iteration in prepareLauncher(). The race condition occurred when: 1. Timer thread flushes array and calls submitAndReset() 2. createTaskArray() iterates over tasks outside the lock 3. prepareLauncher() modifies the collection during iteration This fix ensures createTaskArray() works on its own copy of the list.
✅ Deploy Preview for nextflow-docs-staging ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Member
|
@johnoooh I went ahead and cleaned up the PR to get it ready for merging. I also pushed a temporary fix that might fix the error you are getting, let me know if it works |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, TaskArrayCollector submits a job array only when the array reaches its configured size. In pipelines with slower task generation or variable workloads, small batches may remain in memory indefinitely if the array never fills, delaying execution. This is especially a problem in pipelines with hanging channels, like channels made by watchPath. We have a few pipelines that run forever, watching a samplesheet for new entries, and processing them as they come in. We would like to use array jobs to reduce strain on the slurm scheduler as much as possible as this has been a problem on our cluster.
My proposal is to create a maximum amount of time to wait for more jobs before submitting the array of jobs. Then the pipeline would work like this:
The first 10 jobs are submitted as an array job as expected. The next 6 jobs are added into the task array. Then, the time specified by executorSubmitTimeout elapses. At this point the remaining 6 jobs are put into a job array and submitted to the cluster.
Discussion of this feature was originally started here:
#5924
Then I further described the issue here: #6465
I'm currently having an issue with java.util.ConcurrentModificationException when running hundreds of jobs.