Skip to content

Commit 6b542e9

Browse files
committed
KEP-2170: Add PyTorch DDP MNIST training example
Signed-off-by: Antonin Stefanutti <[email protected]>
1 parent 1dfa40c commit 6b542e9

File tree

2 files changed

+415
-0
lines changed

2 files changed

+415
-0
lines changed
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# PyTorch DDP MNIST Training Example
2+
3+
This example demonstrates how to train a deep learning model to classify images
4+
of handwritten digits on the [MNIST](https://yann.lecun.com/exdb/mnist/) dataset
5+
using [PyTorch DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).
6+
7+
## Setup
8+
9+
Install the Kubeflow training v2 control-plane on your Kubernetes cluster,
10+
if it's not already deployed:
11+
12+
```console
13+
kubectl apply --server-side -k "https://github.com/kubeflow/training-operator.git/manifests/v2/overlays/standalone?ref=master"
14+
```
15+
16+
Set up the Python environment on your local machine or client:
17+
18+
```console
19+
python -m venv .venv
20+
source .venv/bin/activate
21+
pip install git+https://github.com/kubeflow/training-operator.git@master#subdirectory=sdk_v2
22+
pip install torch
23+
```
24+
25+
You can refer to the [training operator documentation](https://www.kubeflow.org/docs/components/training/installation/)
26+
for more information.
27+
28+
## Usage
29+
30+
```console
31+
python mnist.py --help
32+
usage: mnist.py [-h] [--batch-size N] [--test-batch-size N] [--epochs N] [--lr LR] [--lr-gamma G] [--lr-period P] [--seed S] [--log-interval N] [--save-model]
33+
[--backend {gloo,nccl}] [--num-workers N] [--worker-resources RESOURCE QUANTITY] [--runtime NAME]
34+
35+
PyTorch DDP MNIST Training Example
36+
37+
options:
38+
-h, --help show this help message and exit
39+
--batch-size N input batch size for training [100]
40+
--test-batch-size N input batch size for testing [100]
41+
--epochs N number of epochs to train [10]
42+
--lr LR learning rate [1e-1]
43+
--lr-gamma G learning rate decay factor [0.5]
44+
--lr-period P learning rate decay period in step size [20]
45+
--seed S random seed [0]
46+
--log-interval N how many batches to wait before logging training metrics [10]
47+
--save-model saving the trained model [False]
48+
--backend {gloo,nccl}
49+
Distributed backend [NCCL]
50+
--num-workers N Number of workers [1]
51+
--worker-resources RESOURCE QUANTITY
52+
Resources per worker [cpu: 1, memory: 2Gi, nvidia.com/gpu: 1]
53+
--runtime NAME the training runtime [torch-distributed]
54+
```
55+
56+
## Example
57+
58+
Train the model on 8 worker nodes using 1 NVIDIA GPU each:
59+
60+
```console
61+
python mnist.py \
62+
--num-workers 8 \
63+
--worker-resources "nvidia.com/gpu" 1 \
64+
--worker-resource cpu 1 \
65+
--worker-resources memory 4Gi \
66+
--epochs 50 \
67+
--lr-period 20 \
68+
--lr-gamma 0.8
69+
```
70+
71+
At the end of each epoch, local metrics are printed in each worker logs and the global metrics
72+
are gathered and printed in the rank 0 worker logs.
73+
74+
When the training completes, you should see the following at the end of the rank 0 worker logs:
75+
76+
```text
77+
--------------- Epoch 50 Evaluation ---------------
78+
79+
Local rank 0:
80+
- Loss: 0.0003
81+
- Accuracy: 1242/1250 (99%)
82+
83+
Global metrics:
84+
- Loss: 0.000279
85+
- Accuracy: 9918/10000 (99.18%)
86+
87+
---------------------------------------------------
88+
```

0 commit comments

Comments
 (0)