Skip to content

manager: Add option to skip initial sync #159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 9, 2025
Merged

manager: Add option to skip initial sync #159

merged 6 commits into from
Apr 9, 2025

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented Apr 8, 2025

This is a version of #127 with a few fixes to support init_sync.

Fixes #117

Test plan:

cargo test
pytest torchft/manager_test.py
pytest torchft/manager_integ_test.py

dl541 added 5 commits April 8, 2025 14:28
Summary:
We currently always heal on step 0 to avoid synchronization issues. We want an option to support skipping this sync for users who set the PyTorch seed so all ranks are initialized with the same values.

This diff added a init_sync boolean flag that can be passed from the manager client in python to the manager service in rust. If the manager service skips the sync depending on whether the init_sync is true.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 8, 2025
@d4l3k d4l3k force-pushed the d4l3k/init_sync branch from 4ff4641 to e06c6f2 Compare April 8, 2025 22:00
@d4l3k d4l3k requested review from fduwjj and H-Huang April 8, 2025 22:03
@d4l3k d4l3k merged commit dc1037e into main Apr 9, 2025
8 checks passed
@d4l3k d4l3k deleted the d4l3k/init_sync branch April 9, 2025 01:01
WarrenZhu050413 pushed a commit to guodong/torchft that referenced this pull request Apr 14, 2025
* add option to skip initial sync in Manager

Summary:
We currently always heal on step 0 to avoid synchronization issues. We want an option to support skipping this sync for users who set the PyTorch seed so all ranks are initialized with the same values.

This diff added a init_sync boolean flag that can be passed from the manager client in python to the manager service in rust. If the manager service skips the sync depending on whether the init_sync is true.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Fixed some rust files and tests

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Fix init_sync logic

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Add tests for manager.rs

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Added skip init_sync tests to python client

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* manager: final changes for init_sync

---------

Co-authored-by: Dave Lei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add option to skip initial sync in Manager
4 participants