Skip to content

Added example training scripts for localsgd, DiLoCo, Live Checkpoint Recovery, and proactive failure detection with DDP (#198) #200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

WarrenZhu050413
Copy link
Contributor

TorchFT Examples

This PR adds a comprehensive set of examples demonstrating various fault tolerance features and distributed training approaches in TorchFT. Each example includes a per example README that provide step-by-step instructions, and sample outputs to help users understand and incorporate these features into their training script. All the examples build on top of train_ddp.py.

The PR came from my own experience understanding the different features of torchFT. I found it hard to start running other features outside of the given train_ddp.py at the beginning, which made it more difficult for me to have a sense of the various features offered by torchFT.

@d4l3k provided useful feedback in how to structure the examples.

Examples Included:

  1. DDP with Proactive Failure Recovery (examples/ddp_proactive)

    • Demonstrates how to enable proactive detection and response to worker failures
    • Includes detailed explanation of recovery mechanism with annotated logs
    • Shows significant reduction in recovery time compared to timeout-based approaches
  2. DiLoCo (Distributed Local Convergence) (examples/diloco)

    • Implements DiLoCo training methodology
    • Shows how to configure and optimize local convergence parameters
    • Documents performance characteristics and tradeoffs
  3. LocalSGD (examples/localsgd)

    • Demonstrates LocalSGD with periodic synchronization strategy
    • Provides guidance on setting appropriate synchronization frequency
    • Includes performance comparison considerations
  4. Live Checkpoint Recovery (examples/live_checkpoint_recovery)

    • Shows how to implement checkpoint-based recovery for fault tolerance
    • Documents the checkpoint storage and retrieval process
    • Includes recovery time analysis and optimization tips

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 22, 2025
Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! This is really awesome, thank you for your contributions. I will take a look more closely soon.

One thing I think would be a good idea is to run these examples as part of CI as well so we know we aren't regressing anything. We might have to create a separate CI workflow for this.

@WarrenZhu050413
Copy link
Contributor Author

@d4l3k also commented on this. Working on the CI rn!

@WarrenZhu050413 WarrenZhu050413 force-pushed the torchft_examples branch 2 times, most recently from 089bcd5 to 478c162 Compare May 24, 2025 05:47
@WarrenZhu050413
Copy link
Contributor Author

Added CI.

This is the output when I run examples/test_examples.py locally. The tests are currently CPU only, using torchx.py's default.

For the .yaml file, the environment is set up identically to torchft/.github/workflows/unittest.yaml.

> pytest examples/test_examples.py
======================================== test session starts =========================================
platform linux -- Python 3.11.11, pytest-8.3.5, pluggy-1.5.0
rootdir: /srv/apps/torchft
configfile: pyproject.toml
plugins: typeguard-2.13.3, timeout-2.3.1, anyio-4.9.0
timeout: 60.0s
timeout method: thread
timeout func_only: False
collected 12 items                                                                                   

examples/test_examples.py ............                                                         [100%]

=================================== 12 passed in 99.86s (0:01:39) ====================================

…Recovery, and proactive failure detection with DDP, along with CI (pytorch#198)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants