Skip to content

Added example training scripts for localsgd, DiLoCo, Live Checkpoint Recovery, and proactive failure detection with DDP (#198) #200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions .github/workflows/examples.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
name: Examples

on:
push:
branches:
- main
pull_request:

jobs:
unittest:
strategy:
fail-fast: false
matrix:
include:
- runs-on: "linux.2xlarge"
gpu-arch-type: "cpu"
gpu-arch-version: ""
torch-version: "stable"
- runs-on: "linux.g5.12xlarge.nvidia.gpu"
gpu-arch-type: "cuda"
gpu-arch-version: "12.4"
torch-version: "stable"
- runs-on: "linux.g5.12xlarge.nvidia.gpu"
gpu-arch-type: "cuda"
gpu-arch-version: "12.4"
torch-version: "nightly"

uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
with:
timeout: 120
runner: ${{ matrix.runs-on }}
gpu-arch-type: ${{ matrix.gpu-arch-type }}
gpu-arch-version: ${{ matrix.gpu-arch-version }}
script: |
set -ex

# install python and protobuf
conda create -n venv python=3.12 libprotobuf -y
conda activate venv
python -m pip install --upgrade pip

# install recent version of Rust via rustup
curl https://sh.rustup.rs -sSf | sh -s -- --default-toolchain=stable --profile=default -y
. "$HOME/.cargo/env"

# Optionally install torch nightly, pulls latest CUDA from pip otherwise
if [ "${{ matrix.torch-version }}" = "nightly" ]; then
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
fi
if [ "${{ matrix.torch-version }}" = "test" ]; then
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/test/cu128
fi

# Install dependencies
pip install -e .[dev] -v

# Run tests
pytest examples/test_examples.py
2 changes: 2 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,9 @@ slog-stdlog = "4.1.1"
stderrlog = "0.6.0"
structopt = "0.3.26"
tokio = {version = "1.40.0", features = ["full", "test-util", "tracing", "macros", "rt-multi-thread"] }
tokio-stream = {version = "0.1.14", features = ["sync"]}
tonic = "0.12.2"
futures-core = "0.3"

[build-dependencies]
tonic-build = "0.12.2"
Expand Down
13 changes: 6 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,15 +79,14 @@ We have a minimal DDP train loop that highlights all of the key components in to

See [train_ddp.py](./train_ddp.py) for more info.

### Advanced Examples

### DiLoCo

LocalSGD and DiLoCo are currently experimental.

See
[the diloco_train_loop/local_sgd_train_loop tests](./torchft/local_sgd_integ_test.py)
for an example on how to integrate these algorithms into your training loop.
See the [examples/README.md](./examples/README.md) for advanced examples. Currently, the following examples are available:

- [DDP with proactive failure recovery](./examples/ddp_proactive/README.md): Demonstrates DDP with proactive failure recovery mode
- [DiLoCo](./examples/diloco/README.md): Demonstrates Distributed Local Convergence training
- [LocalSGD](./examples/localsgd/README.md): Demonstrates Local SGD with periodic synchronization
- [Live Checkpoint Recovery](./examples/live_checkpoint_recovery/README.md): Demonstrates live checkpoint recovery

## Design

Expand Down
37 changes: 37 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# TorchFT Examples

This directory contains advanced examples demonstrating various fault tolerance features and training approaches in TorchFT beyond the basic `train_ddp.py` example in the [README](../README.md).

Each directory contains a README with more detailed instructions, as well as extensive documentation on the feature being showcased and how to interpret the outputs.

## List of Examples

- [DDP with proactive failure recovery](./ddp_proactive/README.md): Demonstrates DDP with proactive failure recovery mode
- [DiLoCo](./diloco/README.md): Demonstrates Distributed Local Convergence training
- [LocalSGD](./localsgd/README.md): Demonstrates Local SGD with periodic synchronization
- [Live Checkpoint Recovery](./live_checkpoint_recovery/README.md): Demonstrates live checkpoint recovery

## Running the examples

After starting the lighthouse server by running:

```sh
RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000
```

You can `cd` into the example directory:

```sh
cd examples/[example_directory]
```

and then launch the example with torchX with:

```sh
export QUICK_RUN=1
torchx run
```

the QUICK_RUN environment variable runs the examples for much less steps, and also uses a synthetic, rather than downloaded, dataset. It is useful for testing the examples quickly.

See the `.torchxconfig` file in each example directory for configuration details, and [torchx.py](../torchft/torchx.py) and the [torchX documentation](https://pytorch.org/torchx/latest/) to understand how DDP is being ran.
7 changes: 7 additions & 0 deletions examples/ddp_proactive/.torchxconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
[cli:run]
component=../../torchft/torchx.py:hsdp
scheduler=local_cwd


[component:../../torchft/torchx.py:hsdp]
script=train_ddp_proactive.py
Loading
Loading