🚀[FEA]: Upstream signal handling and graceful termination utilities from GLOBE AirFRANS example to DistributedManager.

### Is this a new feature, an improvement, or a change to existing functionality?

New Feature

### How would you describe the priority of this feature request

Medium

### Please provide a clear description of problem you would like to solve.

The GLOBE AirFRANS example as part of https://github.com/NVIDIA/physicsnemo/pull/1401 adds some utilities for gracefully intercepting signals and shutting down during distributed training. @laserkelvin suggested we upstream these, which I think is a great idea. Creating this issue until we can land the PR for this.

Related threads:
https://github.com/NVIDIA/physicsnemo/pull/1401#discussion_r2834581983
https://github.com/NVIDIA/physicsnemo/pull/1401#discussion_r2834728418

Getting this right is probably something like an instant +5% to +10% training speed on our internal clusters across all models that use this. Our internal clusters run on 4-hour-max sbatch scripts, so after setup you might be looking at 3h30m. Being able to catch the very-last-epoch for your checkpoint rather than "near the last epoch" is a big improvement.

### Describe any alternatives you have considered

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀[FEA]: Upstream signal handling and graceful termination utilities from GLOBE AirFRANS example to DistributedManager. #1444

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem you would like to solve.

Describe any alternatives you have considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

🚀[FEA]: Upstream signal handling and graceful termination utilities from GLOBE AirFRANS example to DistributedManager. #1444

Description

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem you would like to solve.

Describe any alternatives you have considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions