Skip to content

🚀[FEA]: Upstream signal handling and graceful termination utilities from GLOBE AirFRANS example to DistributedManager. #1444

@peterdsharpe

Description

@peterdsharpe

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

The GLOBE AirFRANS example as part of #1401 adds some utilities for gracefully intercepting signals and shutting down during distributed training. @laserkelvin suggested we upstream these, which I think is a great idea. Creating this issue until we can land the PR for this.

Related threads:
#1401 (comment)
#1401 (comment)

Getting this right is probably something like an instant +5% to +10% training speed on our internal clusters across all models that use this. Our internal clusters run on 4-hour-max sbatch scripts, so after setup you might be looking at 3h30m. Being able to catch the very-last-epoch for your checkpoint rather than "near the last epoch" is a big improvement.

Describe any alternatives you have considered

No response

Metadata

Metadata

Assignees

Labels

? - Needs TriageNeed team to review and classifyenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions