Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shell level too high error after many re-submitted jobs #181

Open
hagertnl opened this issue Aug 26, 2024 · 2 comments
Open

Shell level too high error after many re-submitted jobs #181

hagertnl opened this issue Aug 26, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@hagertnl
Copy link
Contributor

On jobs that repeatedly re-submit, I tend to get this error:

sh: warning: shell level (1000) too high, resetting to 1

This appears to come from the SHLVL environment variable. While the warning says that it's resetting to 1, it's actually not -- the next run hits this error as well. This leads to some odd/undefined behavior later in the job, such as:

Beginning SBCAST of executable from /lustre/orion/...../every-node/hpl-nompi/1724648164.1819696/build_directory/pre-built/bin/rochpl-6.0.0
sbcast: error: Can't open `/usr/bin/bash:`: No such file or directory
sbcast: error: Broadcast of '/usr/bin/bash:' failed
SBCAST failed (exit code: 255). Ending job and resubmitting if specified.
sh: warning: shell level (1000) too high, resetting to 1

I'm not terribly sure how sbcast is trying to sbcast /usr/bin/bash, but the print statement immediately prior uses a variable that is immediately used in sbcast, so there is not a bug in the Slurm script.

We should look into how to fix this, rather than just canceling & resubmitting jobs.

@hagertnl hagertnl added the bug Something isn't working label Aug 26, 2024
@hagertnl
Copy link
Contributor Author

I'm going to try experimenting with export SHLVL=1 in the setup_env.sh script itself. This may just be something we want to include in the template. This could also be a bug in Bash, for not reaching 1000 and resetting to 1 properly. Right now, it is not resetting to 1.

@hagertnl
Copy link
Contributor Author

Ahh.

The job that is failing is a single-node job running near-constantly (so running about 1000x per day), which is why I was seeing jobs failing more frequently than I thought was expected. I have other jobs that run about 1x per day, and I thought it was that job failing every time.

But I am thinking that jobs should possibly reset SHLVL=1 when they resubmit. I don't have any particular reason why we wouldn't want to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant