Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose squeue polling interval to SlurmExecutor parameter and allow setting via env variable #1143

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

erjel
Copy link
Contributor

@erjel erjel commented Jul 24, 2024

Hi,

we regularly run into problems with our SLURM cluster while running (very) large downsampling jobs on webknossos datasets either via CLI or chunk-wise processing of large datasets via the cluster-tools python API. In particular we get the error message:

 error: slurm_receive_msgs: [[$hostname]:$port] failed: Socket timed out on send/recv operation

our cluster team traced the error down to the SLURM controller being overwhelmed by the number squeue requests. Technically we could downscale the number of concurrent downsampling jobs, but that would negatively impact the overall cluster utilization as well as throughput.

Alternatively we searched for squeue commands in the cluster-tools API. We noticed the line self.executor.get_pending_tasks() in file_wait_thread.py. It seems like you already implemented a polling throttle there via the interval parameter but never expose the parameter to ClusterExecutor or SlurmExecutor to reduce the number of squeue calls.

Therefore I would like to propose a change where SlurmExecutor users can set the polling interval (in seconds) programmatically in their python program or alternatively via environment variable.

I am happy to make any additional changes to this pull request and add documentation if necessary.

Best wishes,
Eric

Issues:

  • Expose the FileWaitThread's interval parameter to SlurmExecutor
  • Set the global variable SLURM_QUEUE_CHECK_INTERVAL via environment variable to provide the same functionality in a CLI-only setting.

Todos:

Make sure to delete unnecessary points or to check all before merging:

  • Updated Changelog
  • Updated Documentation
  • Added / Updated Tests

@philippotto
Copy link
Member

Hi @erjel,

thank you for your contribution! Before talking about your proposed solution, I would like to understand the problem a bit better.

we regularly run into problems with our SLURM cluster while running (very) large downsampling jobs on webknossos datasets either via CLI or chunk-wise processing of large datasets via the cluster-tools python API.

How many datasets do you downsample in parallel? There should only be one SlurmExecutor instance per dataset downsampling and therefore, only one polling party per dataset.

Technically we could downscale the number of concurrent downsampling jobs, but [...]

By "number of concurrent downsampling jobs" you mean number of datasets being conurrently downsampled, right?

our cluster team traced the error down to the SLURM controller being overwhelmed by the number squeue requests.

How many squeue requests are we talking about and what interval do you want to configure to mitigate the issue?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants