Expose squeue polling interval to SlurmExecutor parameter and allow setting via env variable #1143

erjel · 2024-07-24T13:29:35Z

Hi,

we regularly run into problems with our SLURM cluster while running (very) large downsampling jobs on webknossos datasets either via CLI or chunk-wise processing of large datasets via the cluster-tools python API. In particular we get the error message:

 error: slurm_receive_msgs: [[$hostname]:$port] failed: Socket timed out on send/recv operation

our cluster team traced the error down to the SLURM controller being overwhelmed by the number squeue requests. Technically we could downscale the number of concurrent downsampling jobs, but that would negatively impact the overall cluster utilization as well as throughput.

Alternatively we searched for squeue commands in the cluster-tools API. We noticed the line self.executor.get_pending_tasks() in file_wait_thread.py. It seems like you already implemented a polling throttle there via the interval parameter but never expose the parameter to ClusterExecutor or SlurmExecutor to reduce the number of squeue calls.

Therefore I would like to propose a change where SlurmExecutor users can set the polling interval (in seconds) programmatically in their python program or alternatively via environment variable.

I am happy to make any additional changes to this pull request and add documentation if necessary.

Best wishes,
Eric

Issues:

Expose the FileWaitThread's interval parameter to SlurmExecutor
Set the global variable SLURM_QUEUE_CHECK_INTERVAL via environment variable to provide the same functionality in a CLI-only setting.

Todos:

Make sure to delete unnecessary points or to check all before merging:

Updated Changelog
Updated Documentation
Added / Updated Tests

…r and env variable

philippotto · 2024-07-24T14:11:02Z

Hi @erjel,

thank you for your contribution! Before talking about your proposed solution, I would like to understand the problem a bit better.

we regularly run into problems with our SLURM cluster while running (very) large downsampling jobs on webknossos datasets either via CLI or chunk-wise processing of large datasets via the cluster-tools python API.

How many datasets do you downsample in parallel? There should only be one SlurmExecutor instance per dataset downsampling and therefore, only one polling party per dataset.

Technically we could downscale the number of concurrent downsampling jobs, but [...]

By "number of concurrent downsampling jobs" you mean number of datasets being conurrently downsampled, right?

our cluster team traced the error down to the SLURM controller being overwhelmed by the number squeue requests.

How many squeue requests are we talking about and what interval do you want to configure to mitigate the issue?

Thank you!

feature: Expose squeue polling interval to SlurmExecutor via paramete…

cf9bc25

…r and env variable

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose squeue polling interval to SlurmExecutor parameter and allow setting via env variable #1143

Expose squeue polling interval to SlurmExecutor parameter and allow setting via env variable #1143

erjel commented Jul 24, 2024

philippotto commented Jul 24, 2024

Expose squeue polling interval to SlurmExecutor parameter and allow setting via env variable #1143

Are you sure you want to change the base?

Expose squeue polling interval to SlurmExecutor parameter and allow setting via env variable #1143

Conversation

erjel commented Jul 24, 2024

Issues:

Todos:

philippotto commented Jul 24, 2024