-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check error handling during Della maintenance downtime #45
Comments
Ran a couple tests when Della is down -- no training tasks are left hanging, which is good, but the fact that the train tasks are failing is unclear to users unless they read the message of the task reports. The tasks are being identified as "Finished" and no error notifications are otherwise being presented to the user. ![]()
![]()
![]()
The error is hit in the below section of tasks.py:
|
I wonder if an appropriate way of checking for the connection might be to just first run a test command on |
@cmroughan Your proposed behavior sounds reasonable to me. Is there also a phase of maintenance when the filesystem is accessible but slurm is not? I just checked, fabric (the library we're using for remote access) has a method to check if a path exists: Do you need me to do anything for this issue? |
@cmroughan I discussed with @mnaydan, this is low priority from our perspective and I don't have capacity — if you want to adjust this behavior, you can use the fabric file exists check that I shared, and I'd be glad to review a pull request to check your changes when/if you get to this. It would probably make sense to create a new issue for the proposed solution, which you can prioritize separately; the work of checking what happens has been completed. |
Sounds good. Future work will be tracked in #50 . |
The next scheduled monthly downtime is February 11 -- take that day to test error messaging and handling when users try to submit jobs while the cluster is down. Confirm that the error message (+ the regular email that I would send out to warn users about the downtime) is sufficient to keep users informed.
The text was updated successfully, but these errors were encountered: