Skip to content

Rank local checkpointing in DCP internal without collectives #989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

saumishr
Copy link
Contributor

@saumishr saumishr commented Apr 8, 2025

Summary:

Context

DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing (XLFormers style checkpointing) which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D72390326

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72390326

saumishr added a commit to saumishr/tnt that referenced this pull request Apr 16, 2025
…#989)

Summary:

### Context
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing (XLFormers style checkpointing) which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D72390326
…#989)

Summary:
Pull Request resolved: pytorch#989

### Context
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing (XLFormers style checkpointing) which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D72390326
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72390326

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants