Large-batch Training of Language Models

The original README of Megatron-LM is README_old.md.

Setup

Dataset

EC2

Download from s3 (check M*EKS Tutorial for the setup).

# Wikipedia preprocessed for Megatron-LM. model: 4-layer BERT, T5
aws --profile gluonnlp s3 cp s3://mstar-eks-dev-us-east-2/annnxu/my-bert_text_sentence.bin ./
aws --profile gluonnlp s3 cp s3://mstar-eks-dev-us-east-2/annnxu/my-bert_text_sentence.idx ./
# Wikipedia + BookCorpus preprocessed for Megatron-LM. model: BERT large
aws --profile gluonnlp s3 cp s3://mstar-eks-dev-us-east-2/annnxu/bert_text_sentence.bin ./
aws --profile gluonnlp s3 cp s3://mstar-eks-dev-us-east-2/annnxu/bert_text_sentence.idx ./
# jsonl of BookCorpus before preprocess
aws --profile gluonnlp s3 cp s3://mstar-eks-dev-us-east-2/annnxu/bookcorpus.jsonl ./
# logs, read.py, plot.py to plot figures. ${id} in [2,3,4,5,6,78,9,10,11,12]. It is '78' because I went to ICML on week 7 and 8, so I put the logs of these two weeks together.
aws --profile gluonnlp s3 cp s3://mstar-eks-dev-us-east-2/annnxu/logs_week${id} ./logs_week${id}/ --recursive

The Wikipedia dataset is downloaded and preprocessed following Megatron-LM README_old.md. The BookCorpus is downloaded from online, concatenated with Wikipedia, and then preprocessed with Megatron-LM in the same way.

Download vocabulary.

wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt

Remember to move all downloaded above to directory ~/data for EC2 instances.

EKS

For EKS, specify the data path in the yaml file.

Environment

EC2

I followed MIST Intern Onboarding Guide to create EC2 instances.

In each EC2 instance, create conda environment named "p37".

conda create -n p37 python=3.7 -y
conda activate p37
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch -y
conda install regex ninja nltk pybind11 -y

Install Apex.

cd ~
git clone https://github.com/anxuthu/apex.git
cd ~/apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Download the codes from my Weekly Progress, unzip it, and move it to ~/.

EKS

Check M*EKS Tutorial for the setup; slack @zhenghuj for any questions regarding EKS.

I have already uploaded the docker image (for BERT large), so that it can be directly specified (747303060528.dkr.ecr.us-east-2.amazonaws.com/mstar-eks:annnxu) in the yaml file to be submitted to the EKS cluster.

To create a new docker image, I use an EC2 instance to run the following command (check M*EKS Tutorial "Build with your customized docker image" for prior procedures) after downloading the codes.

cd ~/megatron
sudo chmod 666 /var/run/docker.sock
DOCKER_BUILDKIT=1 docker build --no-cache -t mstar-eks -f Dockerfile .
docker tag mstar-eks:latest 747303060528.dkr.ecr.us-east-2.amazonaws.com/mstar-eks:annnxu # replace "annnxu"
docker push 747303060528.dkr.ecr.us-east-2.amazonaws.com/mstar-eks:annnxu # upload; replace "annnxu"

Node Configuration

For 4-layer BERT and T5, I use EC2 g4dn.12xlarge inctances, each possessing 4 GPUs. The Amazon Machine Image (AMI) is "Deep Learning AMI (Ubuntu 18.04) Version 60.4".

For 4-layer BERT and T5 with tensor parallelism=8 (larger than 4), I use EC2 g4dn.metal instances, each possessing 8 GPUs.

Note: for distributed training with EC2 instances,

locally run scripts with NNODES=1 and NODE_RANK=0 first to create the index map, then set NNODES and NODE_RANK for each instance following the distributed setting.
make sure micro_batch_size x #GPUs <= global_batch_size.

For BERT large (24 layers) pre-training, I use the EKS cluster.

4-layer BERT

Shorter Training Steps

Check ./bert4_scripts, where "lamb" denotes FusedLAMB from Apex, "mylamb" denotes my PyTorch implementation of LAMB, "mylamb2" denotes our first proposed method layer-wise noise. I use one g4dn.12xlarge for each experiment, which should take 3-4 hours. Just run

./bert4_scripts/xxxx.sh

Longer Training Steps

Check ./bert4_scripts2, where "mylamb3" denotes our method by increasing the learning rate for the embedding weight. I use 4 g4dn.12xlarge for each experiment, which should take 2 hours. Run

#lr=0.01 for B=512, 1k, 2k; lr=0.01 * (2 ** 0.5) for B=4k; lr=0.02 for B=8k, 16k.
#for mylamb3, set "--alpha 1.0"
./bert4_scripts2/xxxx.sh $MASTER_ADDR $NNODES $NODE_RANK $lr

For tensor parallel = 2, 4 experiments, the training time is in proportional to the data parallelism, so the training time is 4 and 8 hours respectively with 4 g4dn.12xlarge nodes. Run

./bert4_scripts2/xxxx_tp.sh $MASTER_ADDR $NNODES $NODE_RANK $TENSOR_PARALLELISM

For tensor parallel = 8 experiments, remember to set "GPUS_PER_NODE=8" instead. I use 8 g4dn.metal nodes and it takes about 3-4 hours. Run the same script above.

./bert4_scripts2/xxxx_tp.sh $MASTER_ADDR $NNODES $NODE_RANK $TENSOR_PARALLELISM

BERT large (24 layers)

Check ./bert24_yaml. First setup the cluster

mstarx --profile gluonnlp config --cluster mstar-eks --region us-east-2 # cluster us-east-2

Cluster usage can be found in CloudWatch -> Dashboards -> mstar-eks. Job dag can be found in Airflow. The output is in /mnt_out/annnxu/.

Submit the job to EKS via

mstarx --profile gluonnlp submit -f bert24_yaml/xxxx.yaml

Each experiment should take about 2 days with 8 p4 nodes. Remember to set node_num in the yaml file.

For tensor parallelism experiment, add "--tensor-model-parallel-size" argument with 1, 2, 4, or 8 after "pretrain_bert.py" in the yaml file. Tensor parallelism=4 should take about 8 days.

For 1-B BERT (85 layers), check ./bert85_yaml. Each experiment should take about 12 hours with 4 p4 nodes.

T5 small (6 layers)

Check ./t5_scripts. I ues 4 g4dn.12xlarge for each experiment, which should take 12 hours. Run

./t5_scripts/xxxx.sh $MASTER_ADDR $NNODES $NODE_RANK

For tensor parallel = 2, 4 experiments, the training time is 12, 24 hours with 8 g4dn.12xlarge nodes. Run

./t5_scripts/xxxx_tp.sh $MASTER_ADDR $NNODES $NODE_RANK $TENSOR_PARALLELISM

For tensor parallel = 8 experiments, remember to set "GPUS_PER_NODE=8" instead. I use 16 g4dn.metal nodes for T5 and it takes about 11 hours. Run the same script above.

./t5_scripts/xxxx_tp.sh $MASTER_ADDR $NNODES $NODE_RANK $TENSOR_PARALLELISM

1-B BERT (85 layers)

Check ./bert85_yaml.

Submit the job to EKS via

mstarx --profile gluonnlp submit -f bert24_yaml/xxxx.yaml

Each experiment should take about 12 hours with 4 p4 nodes. Remember to set node_num in the yaml file.

Name		Name	Last commit message	Last commit date
Latest commit History 1,706 Commits
docs		docs
examples		examples
images		images
megatron		megatron
scripts		scripts
tasks		tasks
tests		tests
tools		tools
.coveragerc		.coveragerc
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README_old.md		README_old.md
pretrain_bert.py		pretrain_bert.py
pretrain_gpt.py		pretrain_gpt.py
pretrain_ict.py		pretrain_ict.py
pretrain_t5.py		pretrain_t5.py
pretrain_vision_classify.py		pretrain_vision_classify.py
pretrain_vision_dino.py		pretrain_vision_dino.py
pretrain_vision_inpaint.py		pretrain_vision_inpaint.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Large-batch Training of Language Models

Setup

Dataset

EC2

EKS

Environment

EC2

EKS

Node Configuration

4-layer BERT

Shorter Training Steps

Longer Training Steps

BERT large (24 layers)

T5 small (6 layers)

1-B BERT (85 layers)

About

Uh oh!

Releases

Packages

Languages

License

ParamsRaman/megatron

Folders and files

Latest commit

History

Repository files navigation

Large-batch Training of Language Models

Setup

Dataset

EC2

EKS

Environment

EC2

EKS

Node Configuration

4-layer BERT

Shorter Training Steps

Longer Training Steps

BERT large (24 layers)

T5 small (6 layers)

1-B BERT (85 layers)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages