Generative Language Model Pretrained on Inspur's Yuan Dataset, codebase for ASC22 supercomputing competition
To simplify experiments on different distributed training frameworks, we decoupled the training code into config
, data
, model
and trainer
modules.
The idea of this decoupling is inspired by pytorch-lightning, however we decoupled it even further to make it more flexible when integrating with other frameworks.
We put all hyperparameters and configurations into config
module for better tracing and logging.
We directly use pytorch-lightning.LightningDataModule
since it's interface is well-designed and easy to use.
Since most distributed training framework need to wrap the model before or after model initialization, and pytorch-lightning.LightningModule
has already exposed some problem in integrating multiple frameworks simultaneously, we decide to further decouple this module into BaseModel
class.
The BaseModel
directly inherits nn.Module
, which is the compatible for most of the distributed training frameworks. All implementations of the language model are derived from BaseModel
and maintain only the model config, the model structure, the forward method, the loss function and the optimizer.
Currently, implemented models include:
- native model: written in native pytorch
- huggingface model: written in HuggingFace's transformers
Now we put everything else like model initialization, training, validation and testing into trainer
module. All training preparation and iterations are done here.
Currently, implemented trainers include:
- PytorchLightning trainer: distributed training with pytorch-lightning, with deepspeed integration provided by the lightning team
- PatrickStar Trainer
Below are examples of how to launch the training job on different distributed frameworks.
num_nodes
must be set to number of GPUs in all nodes, otherwise it will use the number of GPUs in the master node.
torchrun --nnodes=2 --nproc_per_node=2 --master_addr GPU04 --master_port 9001 --node_rank 1 train.ddp_pl.py
OMP_NUM_THREADS=32 torchrun --nnodes=2 --nproc_per_node=2 --master_addr GPU04 --master_port 9001 --node_rank 1 train.ds_pl.py
Note that OMP_NUM_THREADS
is a must when offload is used, since Optimizer now runs on CPU.
horovodrun -np 2 python train.hvd_pl.py
We still prefer to use torchrun
torchrun --nnodes=1 --nproc_per_node=2 train.pstar.py
GLOO_SOCKET_IFNAME=ibs5 OMP_NUM_THREADS=32 torchrun --master_addr="172.25.2.105" --master_port=29500 --nnodes=2 --node_rank=1 --nproc_per_node=2 train.col_ai.py --config=trainer/colossal_ai/strategy.py
OMP_NUM_THREADS=32 nsys profile -o cpu_adam torchrun --nnodes=2 --nproc_per_node=2 --master_addr GPU04 --master_port 9001 --node_rank 0 train.ds_pl.py
OMP_NUM_THREADS=32 nsys profile --gpu-metrics-device=all --gpuctxsw=true --nic-metrics=true --cuda-memory-usage=true --cudabacktrace=all torchrun --nnodes=2 --nproc_per_node=2 train.col_ai.py --config=trainer/colossal_ai/strategy.py
docker run -it --name pytorch --gpus all --privileged --cap-add=SYS_ADMIN --ipc=host --network=host --ulimit memlock=-1 --ulimit stack=67108864 --device=/dev/infiniband -v $(pwd):/workspace registry.cn-hangzhou.aliyuncs.com/ncj/pytorch bash
Check details in Dockerfile