Skip to content

Commit c953498

Browse files
committed
KEP-2170: Add PyTorch DDP MNIST training example
Signed-off-by: Antonin Stefanutti <[email protected]>
1 parent 1dfa40c commit c953498

File tree

2 files changed

+414
-0
lines changed

2 files changed

+414
-0
lines changed
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# PyTorch DDP MNIST Training Example
2+
3+
This example demonstrates how to train a deep learning model to classify images
4+
of handwritten digits on the [MNIST](https://yann.lecun.com/exdb/mnist/) dataset
5+
using [PyTorch DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).
6+
7+
## Setup
8+
9+
Install the training operator v2 on your Kubernetes cluster:
10+
11+
```console
12+
kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/v2/overlays/standalone?ref=master"
13+
```
14+
15+
Set up the Python environment on your local machine or client:
16+
17+
```console
18+
python -m venv .venv
19+
source .venv/bin/activate
20+
pip install git+https://github.com/kubeflow/training-operator.git@master#subdirectory=sdk_v2
21+
pip install torch
22+
```
23+
24+
You can refer to the [training operator documentation](https://www.kubeflow.org/docs/components/training/installation/)
25+
for more information.
26+
27+
## Usage
28+
29+
```console
30+
python mnist.py --help
31+
usage: mnist.py [-h] [--batch-size N] [--test-batch-size N] [--epochs N] [--lr LR] [--lr-gamma G] [--lr-period P] [--seed S] [--log-interval N] [--save-model]
32+
[--backend {gloo,nccl}] [--num-workers N] [--worker-resources RESOURCE QUANTITY] [--runtime NAME]
33+
34+
PyTorch DDP MNIST Training Example
35+
36+
options:
37+
-h, --help show this help message and exit
38+
--batch-size N input batch size for training [100]
39+
--test-batch-size N input batch size for testing [100]
40+
--epochs N number of epochs to train [10]
41+
--lr LR learning rate [1e-1]
42+
--lr-gamma G learning rate decay factor [0.5]
43+
--lr-period P learning rate decay period in step size [20]
44+
--seed S random seed [0]
45+
--log-interval N how many batches to wait before logging training metrics [10]
46+
--save-model saving the trained model [False]
47+
--backend {gloo,nccl}
48+
Distributed backend [NCCL]
49+
--num-workers N Number of workers [1]
50+
--worker-resources RESOURCE QUANTITY
51+
Resources per worker [cpu: 1, memory: 2Gi, nvidia.com/gpu: 1]
52+
--runtime NAME the training runtime [torch-distributed]
53+
```
54+
55+
## Example
56+
57+
Train the model on 8 worker nodes using 1 NVIDIA GPU each:
58+
59+
```console
60+
python mnist.py \
61+
--num-workers 8 \
62+
--worker-resources "nvidia.com/gpu" 1 \
63+
--worker-resource cpu 1 \
64+
--worker-resources memory 4Gi
65+
--epochs 50 \
66+
--lr-period 20 \
67+
--lr-gamma 0.8
68+
```
69+
70+
At the end of each epoch, local metrics are printed in each worker logs and the global metrics
71+
are gathered and printed in the rank 0 worker logs.
72+
73+
When the training completes, you should see the following at the end of the rank 0 worker logs:
74+
75+
```text
76+
--------------- Epoch 50 Evaluation ---------------
77+
78+
Local rank 0:
79+
- Loss: 0.0003
80+
- Accuracy: 1242/1250 (99%)
81+
82+
Global metrics:
83+
- Loss: 0.000279
84+
- Accuracy: 9918/10000 (99.18%)
85+
86+
---------------------------------------------------
87+
```

0 commit comments

Comments
 (0)