Skip to content

fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe()#460

Open
ko3n1g wants to merge 11 commits intomainfrom
ko3n1g/fix/dgxcloud-unknown-state-pending
Open

fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe()#460
ko3n1g wants to merge 11 commits intomainfrom
ko3n1g/fix/dgxcloud-unknown-state-pending

Conversation

@ko3n1g
Copy link
Contributor

@ko3n1g ko3n1g commented Mar 12, 2026

When jobs are really long in queue, there's a chance that sacct randomly fails. This takes down the whole torchx pipeline and leads to a dangling job in the slurm queue that we need to kill manually.

In case sacct repeatedly fails, we cancel the job before giving up.

@ko3n1g ko3n1g changed the title Ko3n1g/fix/dgxcloud unknown state pending @claude fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe() Mar 12, 2026
@ko3n1g ko3n1g changed the title @claude fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe() fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe() Mar 12, 2026
ko3n1g and others added 7 commits March 12, 2026 10:33
…e failures

When a job is submitted to DGXCloud, the API may transiently return a
non-200 response or an "Unknown" phase before the workload is fully
registered. Previously this was mapped to AppState.FAILED, causing
wait_and_exit() to treat the job as terminated immediately while the
pod was still starting up on the cluster.

- DGXCloudState.UNKNOWN now maps to AppState.PENDING in DGX_STATES
- executor.status() returns None (instead of DGXCloudState.UNKNOWN)
  on non-200 HTTP responses so transient API errors don't look like
  a real "Unknown" phase reported by the scheduler
- describe() fallback for unknown keys in DGX_STATES changed to PENDING
- Tests updated and added to cover all three code paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
The status() method was calling GET /workloads/{job_id} (generic endpoint)
which returns 403 for distributed and training workloads. The correct
endpoints match the create paths: /workloads/distributed/{job_id} for
multi-node jobs and /workloads/trainings/{job_id} for single-node jobs.
This is consistent with how cancel() already uses /workloads/distributed/.

Adds test_status_distributed to verify the correct URL is used for
multi-node executors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
The /workloads/distributed/{id} and /workloads/trainings/{id} endpoints
return actualPhase, not phase (which was the field on the generic
/workloads/{id} endpoint). This caused a KeyError crash immediately
after the 403 fix landed.

Now reads actualPhase first, falls back to phase for compatibility,
and returns None (PENDING) if neither field is present.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
…arsing

When a role name ends with '_', the app_id string looks like:
  experiment___role_name____job_id
Splitting on '___' produces job_id = '_job_id' (spurious leading '_'),
causing the status/cancel/log_iter calls to use a wrong ID and get 404.

Fix: _save_job_dir now stores the actual job_id in the JSON record.
describe(), log_iter(), and _cancel_existing() all read job_id from
the stored record, falling back to app_id.split('___')[-1] for
backwards compatibility with existing saved jobs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
After long-running jobs (hours), a transient sacct failure (daemon hiccup,
invoke.UnexpectedExit from non-zero exit code, etc.) would propagate
uncaught through describe() → runner.wait() → wait_and_exit(), killing
the wait loop and reporting EXIT_CODE_TRAINING=1 even though the Slurm
job was still running.

Wrap the sacct call in a try/except and return AppState.UNKNOWN on
failure. UNKNOWN is non-terminal in torchx so polling continues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g force-pushed the ko3n1g/fix/dgxcloud-unknown-state-pending branch from 5987cb6 to 568a242 Compare March 12, 2026 10:33
ko3n1g added 2 commits March 12, 2026 10:34
Signed-off-by: oliver könig <okoenig@nvidia.com>
)
except Exception as e:
log.warning(f"Failed to query sacct for job {app_id}: {e}. Treating as transient.")
return DescribeAppResponse(app_id=app_id, state=AppState.UNKNOWN)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meaning, this will never be bubbled up. Is that fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i think this should be fine.. it doesn't happen very often that sacct doesn't run. But when it fails, it's unfortuante to loose the Run session (and have a dangling job). I think this should be safe

Signed-off-by: oliver könig <okoenig@nvidia.com>
class UnknownStatusError(Exception): ...


class PersistentSacctFailure(Exception): ...

Check notice

Code scanning / CodeQL

Statement has no effect Note

This statement has no effect.

Copilot Autofix

AI about 6 hours ago

In general, to fix "statement has no effect" issues, remove or replace expression statements that do not produce side effects with an appropriate construct (e.g., pass for empty blocks, or a proper method/attribute definition if behavior was intended). For empty class bodies, pass is the standard way to indicate that no additional attributes or methods are defined.

Here, the best minimal fix is to replace the ... expression in the body of PersistentSacctFailure with pass, matching common Python style and avoiding any behavior change: the class remains a simple subclass of Exception with no extra logic. Given the consistent pattern in this file, we can also replace the ... bodies of SetValueError and UnknownStatusError with pass for consistency and to prevent the same warning there as well, all within nemo_run/exceptions.py. No new methods, imports, or definitions are required.

Suggested changeset 1
nemo_run/exceptions.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/nemo_run/exceptions.py b/nemo_run/exceptions.py
--- a/nemo_run/exceptions.py
+++ b/nemo_run/exceptions.py
@@ -14,10 +14,13 @@
 # limitations under the License.
 
 
-class SetValueError(ValueError): ...
+class SetValueError(ValueError):
+    pass
 
 
-class UnknownStatusError(Exception): ...
+class UnknownStatusError(Exception):
+    pass
 
 
-class PersistentSacctFailure(Exception): ...
+class PersistentSacctFailure(Exception):
+    pass
EOF
@@ -14,10 +14,13 @@
# limitations under the License.


class SetValueError(ValueError): ...
class SetValueError(ValueError):
pass


class UnknownStatusError(Exception): ...
class UnknownStatusError(Exception):
pass


class PersistentSacctFailure(Exception): ...
class PersistentSacctFailure(Exception):
pass
Copilot is powered by AI and may make mistakes. Always verify output.
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g added the r0.8.0 Cherry-pick PR to the r0.8.0 release branch label Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.8.0 Cherry-pick PR to the r0.8.0 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants