pytorch · WarrenZhu050413 · May 19, 2025 · May 22, 2025
diff --git a/.github/workflows/examples.yaml b/.github/workflows/examples.yaml
@@ -0,0 +1,58 @@
+name: Examples
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+
+jobs:
+  unittest:
+    strategy:
+      fail-fast: false
+      matrix:
+        include:
+          - runs-on: "linux.2xlarge"
+            gpu-arch-type: "cpu"
+            gpu-arch-version: ""
+            torch-version: "stable"
+          - runs-on: "linux.g5.12xlarge.nvidia.gpu"
+            gpu-arch-type: "cuda"
+            gpu-arch-version: "12.4"
+            torch-version: "stable"
+          - runs-on: "linux.g5.12xlarge.nvidia.gpu"
+            gpu-arch-type: "cuda"
+            gpu-arch-version: "12.4"
+            torch-version: "nightly"
+
+    uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
+    with:
+      timeout: 120
+      runner: ${{ matrix.runs-on }}
+      gpu-arch-type: ${{ matrix.gpu-arch-type }}
+      gpu-arch-version: ${{ matrix.gpu-arch-version }}
+      script: |
+        set -ex
+
+        # install python and protobuf
+        conda create -n venv python=3.12 libprotobuf -y
+        conda activate venv
+        python -m pip install --upgrade pip
+
+        # install recent version of Rust via rustup
+        curl https://sh.rustup.rs -sSf | sh -s -- --default-toolchain=stable --profile=default -y
+        . "$HOME/.cargo/env"
+
+        # Optionally install torch nightly, pulls latest CUDA from pip otherwise
+        if [ "${{ matrix.torch-version }}" = "nightly" ]; then
+          pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
+        fi
+        if [ "${{ matrix.torch-version }}" = "test" ]; then
+          pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/test/cu128
+        fi
+
+        # Install dependencies
+        pip install -e .[dev] -v
+
+        # Run tests
+        pytest examples/test_examples.py
diff --git a/Cargo.toml b/Cargo.toml
@@ -21,7 +21,9 @@ slog-stdlog = "4.1.1"
 stderrlog = "0.6.0"
 structopt = "0.3.26"
 tokio = {version = "1.40.0", features = ["full", "test-util", "tracing", "macros", "rt-multi-thread"] }
+tokio-stream = {version = "0.1.14", features = ["sync"]}
 tonic = "0.12.2"
+futures-core = "0.3"
 
 [build-dependencies]
 tonic-build = "0.12.2"

diff --git a/README.md b/README.md
@@ -79,15 +79,14 @@ We have a minimal DDP train loop that highlights all of the key components in to
 
 See [train_ddp.py](./train_ddp.py) for more info.
 
+### Advanced Examples
 
-### DiLoCo
-
-LocalSGD and DiLoCo are currently experimental.
-
-See
-[the diloco_train_loop/local_sgd_train_loop tests](./torchft/local_sgd_integ_test.py)
-for an example on how to integrate these algorithms into your training loop.
+See the [examples/README.md](./examples/README.md) for advanced examples. Currently, the following examples are available:
 
+- [DDP with proactive failure recovery](./examples/ddp_proactive/README.md): Demonstrates DDP with proactive failure recovery mode
+- [DiLoCo](./examples/diloco/README.md): Demonstrates Distributed Local Convergence training
+- [LocalSGD](./examples/localsgd/README.md): Demonstrates Local SGD with periodic synchronization
+- [Live Checkpoint Recovery](./examples/live_checkpoint_recovery/README.md): Demonstrates live checkpoint recovery
 
 ## Design
 

diff --git a/examples/README.md b/examples/README.md
@@ -0,0 +1,37 @@
+# TorchFT Examples
+
+This directory contains advanced examples demonstrating various fault tolerance features and training approaches in TorchFT beyond the basic `train_ddp.py` example in the [README](../README.md).
+
+Each directory contains a README with more detailed instructions, as well as extensive documentation on the feature being showcased and how to interpret the outputs.
+
+## List of Examples
+
+- [DDP with proactive failure recovery](./ddp_proactive/README.md): Demonstrates DDP with proactive failure recovery mode
+- [DiLoCo](./diloco/README.md): Demonstrates Distributed Local Convergence training
+- [LocalSGD](./localsgd/README.md): Demonstrates Local SGD with periodic synchronization
+- [Live Checkpoint Recovery](./live_checkpoint_recovery/README.md): Demonstrates live checkpoint recovery
+
+## Running the examples
+
+After starting the lighthouse server by running:
+
+```sh
+RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000
+```
+
+You can `cd` into the example directory:
+
+```sh
+cd examples/[example_directory]
+```
+
+and then launch the example with torchX with:
+
+```sh
+export QUICK_RUN=1
+torchx run
+```
+
+the QUICK_RUN environment variable runs the examples for much less steps, and also uses a synthetic, rather than downloaded, dataset. It is useful for testing the examples quickly.
+
+See the `.torchxconfig` file in each example directory for configuration details, and [torchx.py](../torchft/torchx.py) and the [torchX documentation](https://pytorch.org/torchx/latest/) to understand how DDP is being ran. 
diff --git a/examples/ddp_proactive/.torchxconfig b/examples/ddp_proactive/.torchxconfig
@@ -0,0 +1,7 @@
+[cli:run]
+component=../../torchft/torchx.py:hsdp
+scheduler=local_cwd
+
+
+[component:../../torchft/torchx.py:hsdp]
+script=train_ddp_proactive.py