|
1 | | -# Domino Example |
| 1 | +# Running Tensor Parallel Training with Domino |
2 | 2 |
|
3 | | -## Install Dependency Libraries |
| 3 | +This example demonstrates how to use Domino for tensor parallel training with large language models such as GPT-3. The setup has been validated on: |
| 4 | + |
| 5 | + - NVIDIA H200 GPUs using the Docker image: `nvcr.io/nvidia/pytorch:24.12-py3` |
| 6 | + |
| 7 | + - AMD MI300 GPUs using the Docker image: `rocm/pytorch:rocm6.3.4_ubuntu22.04_py3.10_pytorch_release_2.4.0` |
| 8 | + |
| 9 | +You can pull the same docker images using the following commands: |
| 10 | + |
| 11 | +``` |
| 12 | +docker pull nvcr.io/nvidia/pytorch:24.12-py3 |
| 13 | +
|
| 14 | +docker pull rocm/pytorch:rocm6.3.4_ubuntu22.04_py3.10_pytorch_release_2.4.0 |
| 15 | +``` |
| 16 | + |
| 17 | +## Install Dependencies |
4 | 18 | ``` |
5 | 19 | pip install -r requirements.txt |
6 | 20 | ``` |
7 | 21 |
|
8 | 22 | ## Prepare the Dataset |
9 | 23 | Follow the instructions from [Megatron-DeepSpeed](https://github.com/deepspeedai/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing#download-and-pre-process-training-dataset) to prepare the training dataset. |
10 | 24 |
|
11 | | -## Execute Domino Training |
| 25 | +## Launch Training with Domino |
12 | 26 |
|
13 | | -To start training, adjust the following parameters in the script as needed: |
| 27 | +Adjust the following parameters in the script as needed: |
14 | 28 |
|
15 | 29 | - **GPUS_PER_NODE**: Number of GPUs per node. |
16 | | -- **CHECKPOINT_PATH**: Path to the checkpoint, if applicable. |
17 | 30 | - **VOCAB_FILE**, **MERGE_FILE**, **DATA_PATH**: Paths to the dataset files. |
18 | 31 | - **--micro-batch-size**: Batch size per GPU. |
19 | 32 |
|
20 | | -### Available Models and Scripts |
| 33 | +### Supported Models and Scripts |
21 | 34 |
|
22 | 35 | | Model | Script | |
23 | 36 | |------------|--------------------------| |
24 | | -| GPT-3 2.7B | `pretrain_gpt3_2.7b.sh` | |
25 | 37 | | GPT-3 6.7B | `pretrain_gpt3_6.7b.sh` | |
26 | | -| LLaMA 7B | `pretrain_llama_7b.sh` | |
27 | | -| LLaMA 13B | `pretrain_llama_13b.sh` | |
| 38 | +| GPT-3 13B | `pretrain_gpt3_13b.sh` | |
| 39 | + |
| 40 | + |
28 | 41 |
|
29 | 42 | ### Example |
30 | 43 |
|
31 | | -To train the GPT-3 2.7B model, run the following command: |
| 44 | +To train the GPT-3 13B model, run the following command: |
32 | 45 |
|
33 | 46 | ```bash |
34 | | -bash pretrain_gpt3_2.7b.sh |
| 47 | +bash pretrain_gpt3_13b.sh |
35 | 48 | ``` |
36 | 49 |
|
37 | | -The output should look like this: |
| 50 | +Sample output during training: |
38 | 51 |
|
39 | 52 | ``` |
40 | | -training ... |
41 | | -iteration: 1 | loss: 11.318 | iteration time (ms): 2174.0469932556152 |
42 | | -iteration: 2 | loss: 11.307 | iteration time (ms): 1414.4024848937988 |
43 | | -iteration: 3 | loss: 11.323 | iteration time (ms): 1385.9455585479736 |
44 | | -iteration: 4 | loss: 11.310 | iteration time (ms): 1475.5175113677979 |
45 | | -iteration: 5 | loss: 11.306 | iteration time (ms): 1395.7207202911377 |
46 | | -iteration: 6 | loss: 11.315 | iteration time (ms): 1392.2104835510254 |
47 | | -iteration: 7 | loss: 11.314 | iteration time (ms): 1402.6703834533691 |
48 | | -iteration: 8 | loss: 11.309 | iteration time (ms): 1450.613260269165 |
49 | | -iteration: 9 | loss: 11.305 | iteration time (ms): 1473.1688499450684 |
50 | | -iteration: 10 | loss: 11.320 | iteration time (ms): 1398.4534740447998 |
51 | | -[2024-11-04 15:32:30,918] [INFO] [launch.py:351:main] Process 73015 exits successfully. |
52 | | -[2024-11-04 15:32:30,918] [INFO] [launch.py:351:main] Process 73017 exits successfully. |
53 | | -[2024-11-04 15:32:30,919] [INFO] [launch.py:351:main] Process 73014 exits successfully. |
54 | | -[2024-11-04 15:32:30,919] [INFO] [launch.py:351:main] Process 73016 exits successfully. |
| 53 | +... |
| 54 | +iteration: 30 | loss: 10.120 | iteration time (ms): 528.60 |
| 55 | +iteration: 31 | loss: 9.984 | iteration time (ms): 527.02 |
| 56 | +iteration: 32 | loss: 9.751 | iteration time (ms): 521.55 |
| 57 | +iteration: 33 | loss: 9.496 | iteration time (ms): 525.22 |
| 58 | +iteration: 34 | loss: 9.510 | iteration time (ms): 523.22 |
| 59 | +iteration: 35 | loss: 9.551 | iteration time (ms): 527.20 |
| 60 | +iteration: 36 | loss: 9.549 | iteration time (ms): 525.23 |
| 61 | +iteration: 37 | loss: 9.204 | iteration time (ms): 527.17 |
| 62 | +iteration: 38 | loss: 9.215 | iteration time (ms): 524.86 |
| 63 | +iteration: 39 | loss: 9.091 | iteration time (ms): 525.64 |
| 64 | +iteration: 40 | loss: 8.950 | iteration time (ms): 523.91 |
| 65 | +iteration: 41 | loss: 8.773 | iteration time (ms): 527.28 |
| 66 | +iteration: 42 | loss: 8.867 | iteration time (ms): 523.56 |
| 67 | +iteration: 43 | loss: 8.705 | iteration time (ms): 524.88 |
| 68 | +iteration: 44 | loss: 8.815 | iteration time (ms): 523.07 |
| 69 | +iteration: 45 | loss: 8.655 | iteration time (ms): 525.73 |
| 70 | +iteration: 46 | loss: 8.740 | iteration time (ms): 525.80 |
| 71 | +iteration: 47 | loss: 8.821 | iteration time (ms): 523.97 |
| 72 | +iteration: 48 | loss: 8.625 | iteration time (ms): 524.56 |
| 73 | +iteration: 49 | loss: 8.520 | iteration time (ms): 524.56 |
| 74 | +iteration: 50 | loss: 8.488 | iteration time (ms): 521.91 |
| 75 | +... |
55 | 76 | ``` |
| 77 | +### Running on AMD GPUs |
| 78 | + |
| 79 | +To run on AMD hardware, you must comment out lines 144–162 in the `initialize.py` file within the Megatron submodule. These lines attempt to locate the `nvcc` compiler, which is not available in AMD environments. This change does not impact performance, as fused kernels are not loaded from this location in current implementations. |
56 | 80 |
|
57 | | -## Advanced Usage |
58 | | -You can compile Pytorch and Apex from source for better performance. |
59 | 81 |
|
60 | | -### Compile PyTorch from Source |
61 | | -Compile PyTorch from source could enable JIT script. |
62 | | -``` |
63 | | -git clone -b v2.1.0 https://github.com/pytorch/pytorch.git |
64 | | -git submodule sync |
65 | | -git submodule update --init --recursive |
66 | | -conda install cmake ninja |
67 | | -pip install -r requirements.txt |
68 | | -conda install intel::mkl-static intel::mkl-include |
69 | | -conda install -c pytorch magma-cuda121 # or the magma-cuda* that matches your CUDA version from https://anaconda.org/pytorch/repo |
70 | | -export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"} |
71 | | -python setup.py develop |
72 | | -
|
73 | | -# Build torchvision |
74 | | -git clone https://github.com/pytorch/vision.git |
75 | | -python setup.py develop |
76 | | -``` |
77 | 82 |
|
78 | | -## Build Apex |
| 83 | +## Build Apex from source |
79 | 84 | ``` |
80 | 85 | git clone https://github.com/NVIDIA/apex |
81 | 86 | cd apex |
|
0 commit comments