Expose squeue polling interval to SlurmExecutor parameter and allow setting via env variable #1143
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
we regularly run into problems with our SLURM cluster while running (very) large downsampling jobs on webknossos datasets either via CLI or chunk-wise processing of large datasets via the cluster-tools python API. In particular we get the error message:
our cluster team traced the error down to the SLURM controller being overwhelmed by the number
squeue
requests. Technically we could downscale the number of concurrent downsampling jobs, but that would negatively impact the overall cluster utilization as well as throughput.Alternatively we searched for
squeue
commands in the cluster-tools API. We noticed the lineself.executor.get_pending_tasks()
infile_wait_thread.py
. It seems like you already implemented a polling throttle there via theinterval
parameter but never expose the parameter toClusterExecutor
orSlurmExecutor
to reduce the number of squeue calls.Therefore I would like to propose a change where
SlurmExecutor
users can set the polling interval (in seconds) programmatically in their python program or alternatively via environment variable.I am happy to make any additional changes to this pull request and add documentation if necessary.
Best wishes,
Eric
Issues:
FileWaitThread
'sinterval
parameter toSlurmExecutor
SLURM_QUEUE_CHECK_INTERVAL
via environment variable to provide the same functionality in a CLI-only setting.Todos:
Make sure to delete unnecessary points or to check all before merging: