Check error handling during Della maintenance downtime #45

cmroughan · 2025-02-03T18:11:44Z

The next scheduled monthly downtime is February 11 -- take that day to test error messaging and handling when users try to submit jobs while the cluster is down. Confirm that the error message (+ the regular email that I would send out to warn users about the downtime) is sufficient to keep users informed.

cmroughan · 2025-02-11T14:39:31Z

Ran a couple tests when Della is down -- no training tasks are left hanging, which is good, but the fact that the train tasks are failing is unclear to users unless they read the message of the task reports. The tasks are being identified as "Finished" and no error notifications are otherwise being presented to the user.

The preliminary message appears as usual.

The job runs for one second and then errors out, with its state being changed to "Finished".

The error is hit when trying to run code on Della, as identified below.

The error is hit in the below section of tasks.py:

            with conn.cd(working_dir):
                result = conn.run(
                    f"module load anaconda3/2024.6 && conda run -n htr2hpc {train_cmd}",
                    env={"ESCRIPTORIUM_API_TOKEN": api_token},
                    warn=True,  # don't throw unexpected error on exit != 0
                )

cmroughan · 2025-02-11T14:44:09Z

I wonder if an appropriate way of checking for the connection might be to just first run a test command on /scratch to see if the filesystem is accessible -- something basic like ls /scratch/gpfs/USER -- and if this fails then cancel the job and send a "Cannot connect to Della at this time" notification.

rlskoeser · 2025-02-11T18:34:01Z

@cmroughan Your proposed behavior sounds reasonable to me. Is there also a phase of maintenance when the filesystem is accessible but slurm is not?

I just checked, fabric (the library we're using for remote access) has a method to check if a path exists:
https://docs.fabfile.org/en/1.13/api/contrib/files.html#fabric.contrib.files.exists

Do you need me to do anything for this issue?

rlskoeser · 2025-02-14T17:17:14Z

@cmroughan I discussed with @mnaydan, this is low priority from our perspective and I don't have capacity — if you want to adjust this behavior, you can use the fabric file exists check that I shared, and I'd be glad to review a pull request to check your changes when/if you get to this.

It would probably make sense to create a new issue for the proposed solution, which you can prioritize separately; the work of checking what happens has been completed.

cmroughan · 2025-02-14T18:07:17Z

Sounds good. Future work will be tracked in #50 .

cmroughan self-assigned this Feb 3, 2025

cmroughan mentioned this issue Feb 3, 2025

improved error handling for htr2hpc train script #36

Closed

12 tasks

cmroughan mentioned this issue Feb 14, 2025

improve error handling for Della downtime #50

Open

cmroughan closed this as completed Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check error handling during Della maintenance downtime #45

Check error handling during Della maintenance downtime #45

cmroughan commented Feb 3, 2025

cmroughan commented Feb 11, 2025

cmroughan commented Feb 11, 2025

rlskoeser commented Feb 11, 2025

rlskoeser commented Feb 14, 2025

cmroughan commented Feb 14, 2025

Check error handling during Della maintenance downtime #45

Check error handling during Della maintenance downtime #45

Comments

cmroughan commented Feb 3, 2025

cmroughan commented Feb 11, 2025

cmroughan commented Feb 11, 2025

rlskoeser commented Feb 11, 2025

rlskoeser commented Feb 14, 2025

cmroughan commented Feb 14, 2025