-
Notifications
You must be signed in to change notification settings - Fork 94
fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe() #460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
1d14fa8
b5ce8fe
548d797
2b347ab
dff6c0d
739409a
568a242
c0a50fa
e614699
b85c87e
52911c6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,3 +18,6 @@ | |
|
|
||
|
|
||
| class UnknownStatusError(Exception): ... | ||
|
|
||
|
|
||
| class PersistentSacctFailure(Exception): ... | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -59,10 +59,13 @@ | |
| from nemo_run.core.execution.base import Executor | ||
| from nemo_run.core.execution.slurm import SlurmBatchRequest, SlurmExecutor, SlurmJobDetails | ||
| from nemo_run.core.tunnel.client import LocalTunnel, PackagingJob, SSHTunnel, Tunnel | ||
| from nemo_run.exceptions import PersistentSacctFailure | ||
| from nemo_run.run import experiment as run_experiment | ||
| from nemo_run.run.ray.slurm import SlurmRayRequest | ||
| from nemo_run.run.torchx_backend.schedulers.api import SchedulerMixin | ||
|
|
||
| MAX_CONSECUTIVE_SACCT_FAILURES = 30 | ||
|
|
||
| log: logging.Logger = logging.getLogger(__name__) | ||
| SLURM_JOB_DIRS = os.path.join(get_nemorun_home(), ".slurm_jobs") | ||
|
|
||
|
|
@@ -74,6 +77,7 @@ def __init__( | |
| self.tunnel: Optional[Tunnel] = None | ||
| super().__init__(session_name) | ||
| self.experiment = experiment | ||
| self._consecutive_sacct_failures: dict[str, int] = {} | ||
|
|
||
| # TODO: Move this into the SlurmExecutor | ||
| def _initialize_tunnel(self, tunnel: SSHTunnel | LocalTunnel): | ||
|
|
@@ -240,9 +244,23 @@ def describe(self, app_id: str) -> Optional[DescribeAppResponse]: | |
| return None | ||
|
|
||
| assert self.tunnel, "Tunnel is None." | ||
| p = self.tunnel.run( | ||
| f"sacct --parsable2 -j {app_id}", | ||
| ) | ||
| try: | ||
| p = self.tunnel.run( | ||
| f"sacct --parsable2 -j {app_id}", | ||
| ) | ||
| except Exception as e: | ||
| count = self._consecutive_sacct_failures.get(app_id, 0) + 1 | ||
| self._consecutive_sacct_failures[app_id] = count | ||
| if count >= MAX_CONSECUTIVE_SACCT_FAILURES: | ||
| raise PersistentSacctFailure( | ||
| f"sacct failed {count} consecutive times for job {app_id}: {e}" | ||
| ) from e | ||
| log.warning( | ||
| f"Failed to query sacct for job {app_id} ({count}/{MAX_CONSECUTIVE_SACCT_FAILURES}): " | ||
| f"{e}. Treating as transient." | ||
| ) | ||
| return DescribeAppResponse(app_id=app_id, state=AppState.UNKNOWN) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. meaning, this will never be bubbled up. Is that fine?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes i think this should be fine.. it doesn't happen very often that sacct doesn't run. But when it fails, it's unfortuante to loose the Run session (and have a dangling job). I think this should be safe |
||
| self._consecutive_sacct_failures.pop(app_id, None) | ||
| output = p.stdout.strip().split("\n") | ||
|
|
||
| if len(output) <= 1: | ||
|
|
||
Check notice
Code scanning / CodeQL
Statement has no effect Note
Copilot Autofix
AI about 1 month ago
In general, to fix "statement has no effect" issues, remove or replace expression statements that do not produce side effects with an appropriate construct (e.g.,
passfor empty blocks, or a proper method/attribute definition if behavior was intended). For empty class bodies,passis the standard way to indicate that no additional attributes or methods are defined.Here, the best minimal fix is to replace the
...expression in the body ofPersistentSacctFailurewithpass, matching common Python style and avoiding any behavior change: the class remains a simple subclass ofExceptionwith no extra logic. Given the consistent pattern in this file, we can also replace the...bodies ofSetValueErrorandUnknownStatusErrorwithpassfor consistency and to prevent the same warning there as well, all withinnemo_run/exceptions.py. No new methods, imports, or definitions are required.