-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to requeue a job after sigterm signal on slurm #20542
Comments
Hye, I see you use So what you should do is:
and it would be alright. Hope this helps! |
Hello, I tried both but I use the same signal for both, I just copied 2 different runs |
Hye, Hmm. I was checking your repository, and saw the batch files here. Are you using them to run your jobs? Or do you use the command that you wrote in |
I used the command I wrote in "how to reproduce", not the file in the git repo. I matched the SIGUSR1@90 with a requeue_signal: SIGUSR1 in the .yaml Except the SLURM sbatch files, what you see in the main branch of the repo is what I use |
Ok I see. Can you maybe try to increase the timer of the signal from 90seconds to a higher number, maybe 500 by doing this |
I have also found out there is a --requeue option on sbatch, I have tried it but it didn't seem to change anything. I will try the 500s wait :) |
No it is still not requeuing. still getting:
While I see the sigterm at the end of the job, I don't see any print of the sigusr signal... But I don't know if there should be one. |
Hello, so I found a solution. I need to put it in a sbatch script and add exit 99 at the end. It seems to be true for my slurm cluster and others too. Best, |
Hello, Amazing that you found the solution. |
Hey @jkobject this is great, and thanks @arijit-hub for the interaction! Should we get this on the docs so it's useful for others? Is any of you up for sending a quick PR that adds this as a common use case in the docs? |
Hmm. I dont know. For me it just works without adding any extra things like exit 99. Maybe its a thing specific to the repository that @jkobject is using? Ofcourse the sbatch file makes sense but I believe the doc show examples using sbatch file. @jkobject can I have a look at your sbatch file? Dont forget to remove the sensitive information like your account ;) |
what I am using now is this command: Where the part in "" is my model's config. and the first part is the slurm config the submit.sh is just this: #!/bin/bash
# SBATCH --cpus-per-task=24
# SBATCH --hint=nomultithread
# SBATCH --signal=SIGUSR1@180
# SBATCH --requeue
# run script from above
eval "srun scprint fit $1" --trainer.default_root_dir ./$SLURM_JOB_ID
exit 99 the sbatch_tail is just to tail -F my sbatch script, in my .bashrc: function sbatch_tail() {
tail -F slurm-$(sbatch "$@" | awk '{print $4}').out
} |
Bug description
When running a model fit function on a slurm cluster everything happens correctly but when the time is out I receive
Unfortunately the model never requeues and doesn't even save a checkpoint...
It seems I don't have to add anything here in my config.yml but even when adding
it doesn't change anything, I have also specified
--signal=SIGUSR1@90
in my sbatch cmd.Is there a solution?
What version are you seeing the problem on?
v2.4
How to reproduce the bug
git clone https://github.com/cantinilab/scPRINT
follow installation instruction
sbatch -p gpu -q gpu --gres=gpu:A100:1,gmem:80G --cpus-per-task 20 --mem-per-gpu 80G --ntasks-per-node=1 --signal=SIGUSR1@90 scprint fit --config config/base_v2.yml --config config/pretrain_medium.yml
cc @lantiga
The text was updated successfully, but these errors were encountered: