Skip to content

Commit b018de1

Browse files
authored
Update domino example (#976)
* remove files Signed-off-by: Hongwei Chen <[email protected]> * Update domino example Signed-off-by: Hongwei Chen <[email protected]> * apply review suggestions Signed-off-by: Hongwei Chen <[email protected]> --------- Signed-off-by: Hongwei Chen <[email protected]>
1 parent 207c93c commit b018de1

File tree

170 files changed

+1290
-35968
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

170 files changed

+1290
-35968
lines changed

.gitmodules

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[submodule "training/DeepSpeed-Domino/Megatron-LM"]
2+
path = training/DeepSpeed-Domino/Megatron-LM
3+
url = [email protected]:NVIDIA/Megatron-LM.git
Submodule Megatron-LM added at 375395c

training/DeepSpeed-Domino/README.md

Lines changed: 52 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,81 +1,86 @@
1-
# Domino Example
1+
# Running Tensor Parallel Training with Domino
22

3-
## Install Dependency Libraries
3+
This example demonstrates how to use Domino for tensor parallel training with large language models such as GPT-3. The setup has been validated on:
4+
5+
- NVIDIA H200 GPUs using the Docker image: `nvcr.io/nvidia/pytorch:24.12-py3`
6+
7+
- AMD MI300 GPUs using the Docker image: `rocm/pytorch:rocm6.3.4_ubuntu22.04_py3.10_pytorch_release_2.4.0`
8+
9+
You can pull the same docker images using the following commands:
10+
11+
```
12+
docker pull nvcr.io/nvidia/pytorch:24.12-py3
13+
14+
docker pull rocm/pytorch:rocm6.3.4_ubuntu22.04_py3.10_pytorch_release_2.4.0
15+
```
16+
17+
## Install Dependencies
418
```
519
pip install -r requirements.txt
620
```
721

822
## Prepare the Dataset
923
Follow the instructions from [Megatron-DeepSpeed](https://github.com/deepspeedai/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing#download-and-pre-process-training-dataset) to prepare the training dataset.
1024

11-
## Execute Domino Training
25+
## Launch Training with Domino
1226

13-
To start training, adjust the following parameters in the script as needed:
27+
Adjust the following parameters in the script as needed:
1428

1529
- **GPUS_PER_NODE**: Number of GPUs per node.
16-
- **CHECKPOINT_PATH**: Path to the checkpoint, if applicable.
1730
- **VOCAB_FILE**, **MERGE_FILE**, **DATA_PATH**: Paths to the dataset files.
1831
- **--micro-batch-size**: Batch size per GPU.
1932

20-
### Available Models and Scripts
33+
### Supported Models and Scripts
2134

2235
| Model | Script |
2336
|------------|--------------------------|
24-
| GPT-3 2.7B | `pretrain_gpt3_2.7b.sh` |
2537
| GPT-3 6.7B | `pretrain_gpt3_6.7b.sh` |
26-
| LLaMA 7B | `pretrain_llama_7b.sh` |
27-
| LLaMA 13B | `pretrain_llama_13b.sh` |
38+
| GPT-3 13B | `pretrain_gpt3_13b.sh` |
39+
40+
2841

2942
### Example
3043

31-
To train the GPT-3 2.7B model, run the following command:
44+
To train the GPT-3 13B model, run the following command:
3245

3346
```bash
34-
bash pretrain_gpt3_2.7b.sh
47+
bash pretrain_gpt3_13b.sh
3548
```
3649

37-
The output should look like this:
50+
Sample output during training:
3851

3952
```
40-
training ...
41-
iteration: 1 | loss: 11.318 | iteration time (ms): 2174.0469932556152
42-
iteration: 2 | loss: 11.307 | iteration time (ms): 1414.4024848937988
43-
iteration: 3 | loss: 11.323 | iteration time (ms): 1385.9455585479736
44-
iteration: 4 | loss: 11.310 | iteration time (ms): 1475.5175113677979
45-
iteration: 5 | loss: 11.306 | iteration time (ms): 1395.7207202911377
46-
iteration: 6 | loss: 11.315 | iteration time (ms): 1392.2104835510254
47-
iteration: 7 | loss: 11.314 | iteration time (ms): 1402.6703834533691
48-
iteration: 8 | loss: 11.309 | iteration time (ms): 1450.613260269165
49-
iteration: 9 | loss: 11.305 | iteration time (ms): 1473.1688499450684
50-
iteration: 10 | loss: 11.320 | iteration time (ms): 1398.4534740447998
51-
[2024-11-04 15:32:30,918] [INFO] [launch.py:351:main] Process 73015 exits successfully.
52-
[2024-11-04 15:32:30,918] [INFO] [launch.py:351:main] Process 73017 exits successfully.
53-
[2024-11-04 15:32:30,919] [INFO] [launch.py:351:main] Process 73014 exits successfully.
54-
[2024-11-04 15:32:30,919] [INFO] [launch.py:351:main] Process 73016 exits successfully.
53+
...
54+
iteration: 30 | loss: 10.120 | iteration time (ms): 528.60
55+
iteration: 31 | loss: 9.984 | iteration time (ms): 527.02
56+
iteration: 32 | loss: 9.751 | iteration time (ms): 521.55
57+
iteration: 33 | loss: 9.496 | iteration time (ms): 525.22
58+
iteration: 34 | loss: 9.510 | iteration time (ms): 523.22
59+
iteration: 35 | loss: 9.551 | iteration time (ms): 527.20
60+
iteration: 36 | loss: 9.549 | iteration time (ms): 525.23
61+
iteration: 37 | loss: 9.204 | iteration time (ms): 527.17
62+
iteration: 38 | loss: 9.215 | iteration time (ms): 524.86
63+
iteration: 39 | loss: 9.091 | iteration time (ms): 525.64
64+
iteration: 40 | loss: 8.950 | iteration time (ms): 523.91
65+
iteration: 41 | loss: 8.773 | iteration time (ms): 527.28
66+
iteration: 42 | loss: 8.867 | iteration time (ms): 523.56
67+
iteration: 43 | loss: 8.705 | iteration time (ms): 524.88
68+
iteration: 44 | loss: 8.815 | iteration time (ms): 523.07
69+
iteration: 45 | loss: 8.655 | iteration time (ms): 525.73
70+
iteration: 46 | loss: 8.740 | iteration time (ms): 525.80
71+
iteration: 47 | loss: 8.821 | iteration time (ms): 523.97
72+
iteration: 48 | loss: 8.625 | iteration time (ms): 524.56
73+
iteration: 49 | loss: 8.520 | iteration time (ms): 524.56
74+
iteration: 50 | loss: 8.488 | iteration time (ms): 521.91
75+
...
5576
```
77+
### Running on AMD GPUs
78+
79+
To run on AMD hardware, you must comment out lines 144–162 in the `initialize.py` file within the Megatron submodule. These lines attempt to locate the `nvcc` compiler, which is not available in AMD environments. This change does not impact performance, as fused kernels are not loaded from this location in current implementations.
5680

57-
## Advanced Usage
58-
You can compile Pytorch and Apex from source for better performance.
5981

60-
### Compile PyTorch from Source
61-
Compile PyTorch from source could enable JIT script.
62-
```
63-
git clone -b v2.1.0 https://github.com/pytorch/pytorch.git
64-
git submodule sync
65-
git submodule update --init --recursive
66-
conda install cmake ninja
67-
pip install -r requirements.txt
68-
conda install intel::mkl-static intel::mkl-include
69-
conda install -c pytorch magma-cuda121 # or the magma-cuda* that matches your CUDA version from https://anaconda.org/pytorch/repo
70-
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
71-
python setup.py develop
72-
73-
# Build torchvision
74-
git clone https://github.com/pytorch/vision.git
75-
python setup.py develop
76-
```
7782

78-
## Build Apex
83+
## Build Apex from source
7984
```
8085
git clone https://github.com/NVIDIA/apex
8186
cd apex

training/DeepSpeed-Domino/domino/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)