Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check error handling during Della maintenance downtime #45

Closed
cmroughan opened this issue Feb 3, 2025 · 5 comments
Closed

Check error handling during Della maintenance downtime #45

cmroughan opened this issue Feb 3, 2025 · 5 comments
Assignees

Comments

@cmroughan
Copy link
Collaborator

The next scheduled monthly downtime is February 11 -- take that day to test error messaging and handling when users try to submit jobs while the cluster is down. Confirm that the error message (+ the regular email that I would send out to warn users about the downtime) is sufficient to keep users informed.

@cmroughan
Copy link
Collaborator Author

Ran a couple tests when Della is down -- no training tasks are left hanging, which is good, but the fact that the train tasks are failing is unclear to users unless they read the message of the task reports. The tasks are being identified as "Finished" and no error notifications are otherwise being presented to the user.

Image

The preliminary message appears as usual.

Image

The job runs for one second and then errors out, with its state being changed to "Finished".

Image

The error is hit when trying to run code on Della, as identified below.

The error is hit in the below section of tasks.py:

            with conn.cd(working_dir):
                result = conn.run(
                    f"module load anaconda3/2024.6 && conda run -n htr2hpc {train_cmd}",
                    env={"ESCRIPTORIUM_API_TOKEN": api_token},
                    warn=True,  # don't throw unexpected error on exit != 0
                )

@cmroughan
Copy link
Collaborator Author

I wonder if an appropriate way of checking for the connection might be to just first run a test command on /scratch to see if the filesystem is accessible -- something basic like ls /scratch/gpfs/USER -- and if this fails then cancel the job and send a "Cannot connect to Della at this time" notification.

@rlskoeser
Copy link
Contributor

@cmroughan Your proposed behavior sounds reasonable to me. Is there also a phase of maintenance when the filesystem is accessible but slurm is not?

I just checked, fabric (the library we're using for remote access) has a method to check if a path exists:
https://docs.fabfile.org/en/1.13/api/contrib/files.html#fabric.contrib.files.exists

Do you need me to do anything for this issue?

@rlskoeser
Copy link
Contributor

@cmroughan I discussed with @mnaydan, this is low priority from our perspective and I don't have capacity — if you want to adjust this behavior, you can use the fabric file exists check that I shared, and I'd be glad to review a pull request to check your changes when/if you get to this.

It would probably make sense to create a new issue for the proposed solution, which you can prioritize separately; the work of checking what happens has been completed.

@cmroughan
Copy link
Collaborator Author

Sounds good. Future work will be tracked in #50 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants