Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
Medium
Please provide a clear description of problem you would like to solve.
The GLOBE AirFRANS example as part of #1401 adds some utilities for gracefully intercepting signals and shutting down during distributed training. @laserkelvin suggested we upstream these, which I think is a great idea. Creating this issue until we can land the PR for this.
Related threads:
#1401 (comment)
#1401 (comment)
Getting this right is probably something like an instant +5% to +10% training speed on our internal clusters across all models that use this. Our internal clusters run on 4-hour-max sbatch scripts, so after setup you might be looking at 3h30m. Being able to catch the very-last-epoch for your checkpoint rather than "near the last epoch" is a big improvement.
Describe any alternatives you have considered
No response