You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On jobs that repeatedly re-submit, I tend to get this error:
sh: warning: shell level (1000) too high, resetting to 1
This appears to come from the SHLVL environment variable. While the warning says that it's resetting to 1, it's actually not -- the next run hits this error as well. This leads to some odd/undefined behavior later in the job, such as:
Beginning SBCAST of executable from /lustre/orion/...../every-node/hpl-nompi/1724648164.1819696/build_directory/pre-built/bin/rochpl-6.0.0
sbcast: error: Can't open `/usr/bin/bash:`: No such file or directory
sbcast: error: Broadcast of '/usr/bin/bash:' failed
SBCAST failed (exit code: 255). Ending job and resubmitting if specified.
sh: warning: shell level (1000) too high, resetting to 1
I'm not terribly sure how sbcast is trying to sbcast /usr/bin/bash, but the print statement immediately prior uses a variable that is immediately used in sbcast, so there is not a bug in the Slurm script.
We should look into how to fix this, rather than just canceling & resubmitting jobs.
The text was updated successfully, but these errors were encountered:
I'm going to try experimenting with export SHLVL=1 in the setup_env.sh script itself. This may just be something we want to include in the template. This could also be a bug in Bash, for not reaching 1000 and resetting to 1 properly. Right now, it is not resetting to 1.
The job that is failing is a single-node job running near-constantly (so running about 1000x per day), which is why I was seeing jobs failing more frequently than I thought was expected. I have other jobs that run about 1x per day, and I thought it was that job failing every time.
But I am thinking that jobs should possibly reset SHLVL=1 when they resubmit. I don't have any particular reason why we wouldn't want to do that.
On jobs that repeatedly re-submit, I tend to get this error:
This appears to come from the
SHLVL
environment variable. While the warning says that it's resetting to 1, it's actually not -- the next run hits this error as well. This leads to some odd/undefined behavior later in the job, such as:I'm not terribly sure how
sbcast
is trying tosbcast /usr/bin/bash
, but the print statement immediately prior uses a variable that is immediately used insbcast
, so there is not a bug in the Slurm script.We should look into how to fix this, rather than just canceling & resubmitting jobs.
The text was updated successfully, but these errors were encountered: