fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe()#460
fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe()#460
Conversation
…e failures When a job is submitted to DGXCloud, the API may transiently return a non-200 response or an "Unknown" phase before the workload is fully registered. Previously this was mapped to AppState.FAILED, causing wait_and_exit() to treat the job as terminated immediately while the pod was still starting up on the cluster. - DGXCloudState.UNKNOWN now maps to AppState.PENDING in DGX_STATES - executor.status() returns None (instead of DGXCloudState.UNKNOWN) on non-200 HTTP responses so transient API errors don't look like a real "Unknown" phase reported by the scheduler - describe() fallback for unknown keys in DGX_STATES changed to PENDING - Tests updated and added to cover all three code paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
The status() method was calling GET /workloads/{job_id} (generic endpoint)
which returns 403 for distributed and training workloads. The correct
endpoints match the create paths: /workloads/distributed/{job_id} for
multi-node jobs and /workloads/trainings/{job_id} for single-node jobs.
This is consistent with how cancel() already uses /workloads/distributed/.
Adds test_status_distributed to verify the correct URL is used for
multi-node executors.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
The /workloads/distributed/{id} and /workloads/trainings/{id} endpoints
return actualPhase, not phase (which was the field on the generic
/workloads/{id} endpoint). This caused a KeyError crash immediately
after the 403 fix landed.
Now reads actualPhase first, falls back to phase for compatibility,
and returns None (PENDING) if neither field is present.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
…arsing
When a role name ends with '_', the app_id string looks like:
experiment___role_name____job_id
Splitting on '___' produces job_id = '_job_id' (spurious leading '_'),
causing the status/cancel/log_iter calls to use a wrong ID and get 404.
Fix: _save_job_dir now stores the actual job_id in the JSON record.
describe(), log_iter(), and _cancel_existing() all read job_id from
the stored record, falling back to app_id.split('___')[-1] for
backwards compatibility with existing saved jobs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
After long-running jobs (hours), a transient sacct failure (daemon hiccup, invoke.UnexpectedExit from non-zero exit code, etc.) would propagate uncaught through describe() → runner.wait() → wait_and_exit(), killing the wait loop and reporting EXIT_CODE_TRAINING=1 even though the Slurm job was still running. Wrap the sacct call in a try/except and return AppState.UNKNOWN on failure. UNKNOWN is non-terminal in torchx so polling continues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
5987cb6 to
568a242
Compare
…nknown-state-pending
| ) | ||
| except Exception as e: | ||
| log.warning(f"Failed to query sacct for job {app_id}: {e}. Treating as transient.") | ||
| return DescribeAppResponse(app_id=app_id, state=AppState.UNKNOWN) |
There was a problem hiding this comment.
meaning, this will never be bubbled up. Is that fine?
There was a problem hiding this comment.
yes i think this should be fine.. it doesn't happen very often that sacct doesn't run. But when it fails, it's unfortuante to loose the Run session (and have a dangling job). I think this should be safe
Signed-off-by: oliver könig <okoenig@nvidia.com>
| class UnknownStatusError(Exception): ... | ||
|
|
||
|
|
||
| class PersistentSacctFailure(Exception): ... |
Check notice
Code scanning / CodeQL
Statement has no effect Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 6 hours ago
In general, to fix "statement has no effect" issues, remove or replace expression statements that do not produce side effects with an appropriate construct (e.g., pass for empty blocks, or a proper method/attribute definition if behavior was intended). For empty class bodies, pass is the standard way to indicate that no additional attributes or methods are defined.
Here, the best minimal fix is to replace the ... expression in the body of PersistentSacctFailure with pass, matching common Python style and avoiding any behavior change: the class remains a simple subclass of Exception with no extra logic. Given the consistent pattern in this file, we can also replace the ... bodies of SetValueError and UnknownStatusError with pass for consistency and to prevent the same warning there as well, all within nemo_run/exceptions.py. No new methods, imports, or definitions are required.
| @@ -14,10 +14,13 @@ | ||
| # limitations under the License. | ||
|
|
||
|
|
||
| class SetValueError(ValueError): ... | ||
| class SetValueError(ValueError): | ||
| pass | ||
|
|
||
|
|
||
| class UnknownStatusError(Exception): ... | ||
| class UnknownStatusError(Exception): | ||
| pass | ||
|
|
||
|
|
||
| class PersistentSacctFailure(Exception): ... | ||
| class PersistentSacctFailure(Exception): | ||
| pass |
When jobs are really long in queue, there's a chance that
sacctrandomly fails. This takes down the whole torchx pipeline and leads to a dangling job in the slurm queue that we need to kill manually.In case sacct repeatedly fails, we cancel the job before giving up.