fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe() by ko3n1g · Pull Request #460 · NVIDIA-NeMo/Run

ko3n1g · 2026-03-12T10:32:33Z

When jobs are really long in queue, there's a chance that sacct randomly fails. This takes down the whole torchx pipeline and leads to a dangling job in the slurm queue that we need to kill manually.

In case sacct repeatedly fails, we cancel the job before giving up.

…e failures When a job is submitted to DGXCloud, the API may transiently return a non-200 response or an "Unknown" phase before the workload is fully registered. Previously this was mapped to AppState.FAILED, causing wait_and_exit() to treat the job as terminated immediately while the pod was still starting up on the cluster. - DGXCloudState.UNKNOWN now maps to AppState.PENDING in DGX_STATES - executor.status() returns None (instead of DGXCloudState.UNKNOWN) on non-200 HTTP responses so transient API errors don't look like a real "Unknown" phase reported by the scheduler - describe() fallback for unknown keys in DGX_STATES changed to PENDING - Tests updated and added to cover all three code paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

The status() method was calling GET /workloads/{job_id} (generic endpoint) which returns 403 for distributed and training workloads. The correct endpoints match the create paths: /workloads/distributed/{job_id} for multi-node jobs and /workloads/trainings/{job_id} for single-node jobs. This is consistent with how cancel() already uses /workloads/distributed/. Adds test_status_distributed to verify the correct URL is used for multi-node executors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

The /workloads/distributed/{id} and /workloads/trainings/{id} endpoints return actualPhase, not phase (which was the field on the generic /workloads/{id} endpoint). This caused a KeyError crash immediately after the 403 fix landed. Now reads actualPhase first, falls back to phase for compatibility, and returns None (PENDING) if neither field is present. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

…arsing When a role name ends with '_', the app_id string looks like: experiment___role_name____job_id Splitting on '___' produces job_id = '_job_id' (spurious leading '_'), causing the status/cancel/log_iter calls to use a wrong ID and get 404. Fix: _save_job_dir now stores the actual job_id in the JSON record. describe(), log_iter(), and _cancel_existing() all read job_id from the stored record, falling back to app_id.split('___')[-1] for backwards compatibility with existing saved jobs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

After long-running jobs (hours), a transient sacct failure (daemon hiccup, invoke.UnexpectedExit from non-zero exit code, etc.) would propagate uncaught through describe() → runner.wait() → wait_and_exit(), killing the wait loop and reporting EXIT_CODE_TRAINING=1 even though the Slurm job was still running. Wrap the sacct call in a try/except and return AppState.UNKNOWN on failure. UNKNOWN is non-terminal in torchx so polling continues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

…nknown-state-pending

malay-nagda · 2026-03-12T10:48:38Z

nemo_run/run/torchx_backend/schedulers/slurm.py

+            )
+        except Exception as e:
+            log.warning(f"Failed to query sacct for job {app_id}: {e}. Treating as transient.")
+            return DescribeAppResponse(app_id=app_id, state=AppState.UNKNOWN)


meaning, this will never be bubbled up. Is that fine?

yes i think this should be fine.. it doesn't happen very often that sacct doesn't run. But when it fails, it's unfortuante to loose the Run session (and have a dangling job). I think this should be safe

Signed-off-by: oliver könig <okoenig@nvidia.com>

nemo_run/exceptions.py

 class UnknownStatusError(Exception): ...
+
+
+class PersistentSacctFailure(Exception): ...


In general, to fix "statement has no effect" issues, remove or replace expression statements that do not produce side effects with an appropriate construct (e.g., pass for empty blocks, or a proper method/attribute definition if behavior was intended). For empty class bodies, pass is the standard way to indicate that no additional attributes or methods are defined.

Here, the best minimal fix is to replace the ... expression in the body of PersistentSacctFailure with pass, matching common Python style and avoiding any behavior change: the class remains a simple subclass of Exception with no extra logic. Given the consistent pattern in this file, we can also replace the ... bodies of SetValueError and UnknownStatusError with pass for consistency and to prevent the same warning there as well, all within nemo_run/exceptions.py. No new methods, imports, or definitions are required.

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g changed the title ~~Ko3n1g/fix/dgxcloud unknown state pending~~ @claude fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe() Mar 12, 2026

ko3n1g changed the title ~~@claude fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe()~~ fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe() Mar 12, 2026

ko3n1g and others added 7 commits March 12, 2026 10:33

add tests

2b347ab

Signed-off-by: oliver könig <okoenig@nvidia.com>

format

739409a

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g force-pushed the ko3n1g/fix/dgxcloud-unknown-state-pending branch from 5987cb6 to 568a242 Compare March 12, 2026 10:33

ko3n1g added 2 commits March 12, 2026 10:34

ruff

c0a50fa

Signed-off-by: oliver könig <okoenig@nvidia.com>

Merge remote-tracking branch 'origin/main' into ko3n1g/fix/dgxcloud-u…

e614699

…nknown-state-pending

ko3n1g temporarily deployed to public March 12, 2026 10:37 — with GitHub Actions Inactive

malay-nagda reviewed Mar 12, 2026

View reviewed changes

ko3n1g requested review from hemildesai and malay-nagda March 12, 2026 11:41

implement max-retry//cancel after final attempt

b85c87e

Signed-off-by: oliver könig <okoenig@nvidia.com>

github-advanced-security bot found potential problems Mar 12, 2026

View reviewed changes

ko3n1g temporarily deployed to public March 12, 2026 13:31 — with GitHub Actions Inactive

ko3n1g requested a review from Kipok March 12, 2026 13:31

add tests

52911c6

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g temporarily deployed to public March 12, 2026 13:36 — with GitHub Actions Inactive

ko3n1g added the r0.8.0 Cherry-pick PR to the r0.8.0 release branch label Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe()#460

fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe()#460
ko3n1g wants to merge 11 commits intomainfrom
ko3n1g/fix/dgxcloud-unknown-state-pending

ko3n1g commented Mar 12, 2026 •

edited

Loading

Uh oh!

malay-nagda Mar 12, 2026

Uh oh!

ko3n1g Mar 12, 2026

Uh oh!

Check notice

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		class UnknownStatusError(Exception): ...


		class PersistentSacctFailure(Exception): ...

Conversation

ko3n1g commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malay-nagda Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

ko3n1g Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Check notice

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ko3n1g commented Mar 12, 2026 •

edited

Loading