Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diff with nvidia main #84

Draft
wants to merge 177 commits into
base: nvidia_main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
177 commits
Select commit Hold shift + click to select a range
04a72ed
add iteration argument to load_checkpoint
RaymondLi0 Mar 1, 2022
22005a1
fix load_checkpoint when starting from scratch
RaymondLi0 Mar 10, 2022
e51c3c5
reset consumed_train_samples when finetune=True
RaymondLi0 Apr 19, 2022
884e8b8
add wandb reporting
RaymondLi0 Apr 22, 2022
ad1f261
add validation loss wandb reporting
RaymondLi0 Apr 25, 2022
cc54e57
skip wandb if not provided
RaymondLi0 Apr 26, 2022
97954ac
when finetuning, load optimizer state
RaymondLi0 May 12, 2022
81c72ab
do not load lr scheduler in finetuning
RaymondLi0 May 13, 2022
03c1aa0
add torchrun support
RaymondLi0 May 17, 2022
00cfaad
reload weights into optimizer after loading model weights
RaymondLi0 May 17, 2022
ea9ec20
add --finetune-from argument, so there is no need to modify argument…
RaymondLi0 May 18, 2022
fdc697f
make wandb not a requirement
RaymondLi0 Jun 1, 2022
e52d91c
Merge branch 'main' of github.com:NVIDIA/Megatron-LM into NVIDIA-main
RaymondLi0 Jul 5, 2022
f45d856
Merge branch 'NVIDIA-main' into load-iter
RaymondLi0 Jul 5, 2022
2f751a5
Merge branch 'load-iter' of https://github.com/ElementAI/Megatron-LM …
RaymondLi0 Jul 5, 2022
8036392
add tokens_per_epoch print
RaymondLi0 Jul 13, 2022
652625b
add: initial port of alibi from bigscience
bigximik Jul 22, 2022
9734afd
chg: change back import for this version of Megatron
bigximik Jul 26, 2022
f49506b
cnh: direct import of MixedFusedLayerNorm
bigximik Jul 26, 2022
d3ce018
chg: enums moved to megatron root
bigximik Jul 26, 2022
73c7130
fix: commented logging funcionality which is not implemented yet
bigximik Jul 26, 2022
b09a8d1
chg: refactor for moved enums
bigximik Jul 26, 2022
4e07321
add: port support for positional embedding param from bigscience
bigximik Jul 27, 2022
4ccf237
add: port from bigscience for pos embeding and glu activations args
bigximik Jul 27, 2022
4064969
add multi-query attention logic in attention module
RaymondLi0 Aug 8, 2022
190e328
add kv weight gradient reduction in tensor-parallel group
RaymondLi0 Aug 9, 2022
6fd0c29
more efficient multiquery attention
RaymondLi0 Aug 10, 2022
254ff4b
raise if trying to uyse multi-query cross-atteention
RaymondLi0 Sep 2, 2022
eaf6174
remove expand_key_value parameter since CoreAttention for multi-query…
RaymondLi0 Sep 2, 2022
1513137
remove most timers
RaymondLi0 Sep 2, 2022
b4d6017
chg: move enums back to model
bigximik Sep 7, 2022
691226a
fix: breaking circular import
bigximik Sep 7, 2022
d4ba492
allow to load old checkpoints
RaymondLi0 Sep 7, 2022
5101e5a
Merge pull request #5 from bigcode-project/alibi
RaymondLi0 Sep 7, 2022
d63c4b6
Merge branch 'load-iter' into multi-query-attention
RaymondLi0 Sep 7, 2022
5045d6f
resolve conflict
RaymondLi0 Sep 7, 2022
2117058
implement alibi in multiquery core-attention
RaymondLi0 Sep 7, 2022
16b2b1a
Merge branch 'multi-query-attention' into main
RaymondLi0 Oct 13, 2022
f8547ff
Merge pull request #1 from bigcode-project/main
RaymondLi0 Oct 13, 2022
e8d47a9
add necessary fixes for toolkit-infiniband-example
RaymondLi0 Oct 17, 2022
a360666
add FIM code from EleutherAI, some comments and todo
RaymondLi0 Nov 2, 2022
4390812
add a tokenizer-type for FIM
RaymondLi0 Nov 3, 2022
5ab8702
add spm+psm variants
RaymondLi0 Nov 3, 2022
1f85184
also permute the segment after last eod token, fix permute boundaries
RaymondLi0 Nov 4, 2022
1290e49
fix data type in permutation
RaymondLi0 Nov 5, 2022
a5161d7
truncate or pad after all segments are joined back
RaymondLi0 Nov 5, 2022
641af1d
some cleanup
RaymondLi0 Nov 5, 2022
66e61e7
add preprocessing of HF datasets directly
RaymondLi0 Nov 7, 2022
a79988a
modify max seq-length from 2048 to 8192
RaymondLi0 Nov 8, 2022
db3809b
add missing cases in fused kernels
RaymondLi0 Nov 14, 2022
acda627
add longer sequence lengths in fused kernels test
RaymondLi0 Nov 14, 2022
d59c85b
larger MAX_TOKENS_TO_OOM
RaymondLi0 Nov 14, 2022
7b0cee2
use custom barrier with device_ids
Nov 18, 2022
93cb6a0
add HF tokenizer
Nov 22, 2022
9f2c442
add special tokens in HF tokenizer
RaymondLi0 Nov 22, 2022
8169dec
Merge pull request #9 from bigcode-project/fim
RaymondLi0 Nov 22, 2022
9fe3bcb
fix vocab_size in _HFTokenizer
RaymondLi0 Nov 22, 2022
6982c4e
fix: initialize tokenizer with TokenizerFromFile
Nov 22, 2022
0348b3a
Merge branch 'preprocess-hf' of github.com:bigcode-project/Megatron-L…
Nov 22, 2022
4f060a2
fix: add special_tokens dict for FIM
Nov 22, 2022
332e8db
load attention-head-type from checkpoint
Nov 23, 2022
0717dab
attention-head-type defaults to None instead
Nov 23, 2022
96daa55
use detokenize method un text_generation
Nov 24, 2022
2d36c14
add mqa conversion to huggingface
RaymondLi0 Dec 2, 2022
760eed9
remove config and tokenizer save
RaymondLi0 Dec 2, 2022
baa7b3b
add Readme
RaymondLi0 Dec 2, 2022
2ceaf70
add some documentation
RaymondLi0 Dec 2, 2022
66beabe
add push to hub logic
Dec 2, 2022
de83476
add docs
Dec 2, 2022
1b7c96f
convert_checkpoint as function, push starting from last pushed iteration
RaymondLi0 Dec 2, 2022
5cb878f
add iter_interval argument
RaymondLi0 Dec 2, 2022
ab1c4cc
use relative imports in modeling file
RaymondLi0 Dec 8, 2022
93461dc
Fixes for MQA (#12)
jlamypoirier Dec 9, 2022
63c6fbc
Run with toolkit
jlamypoirier Dec 10, 2022
92e2ca2
Redirect output to logs
jlamypoirier Dec 13, 2022
e2c2c2b
Remove store
jlamypoirier Dec 13, 2022
60fbd1d
update readme
RaymondLi0 Dec 15, 2022
6c4bf90
Merge pull request #13 from bigcode-project/run_with_toolkit
RaymondLi0 Jan 24, 2023
732396a
remove debug prints
Jan 24, 2023
9d80f8a
more precise error for attention_type/head_type values
Jan 24, 2023
7457e32
attention-head-type defaults to multihead again to avoid breaking pre…
RaymondLi0 Jan 24, 2023
cdbcfc9
documentation on the --tokenizer-file argument
RaymondLi0 Jan 30, 2023
94306d1
add missing newlines
RaymondLi0 Jan 30, 2023
506fbd4
revert barrier() to torch.distributed.barrier()
RaymondLi0 Jan 30, 2023
58884e0
Merge pull request #10 from bigcode-project/preprocess-hf
RaymondLi0 Feb 7, 2023
d47f623
Remove hf transformers tools
jlamypoirier Feb 8, 2023
c446836
add santacoder example script
RaymondLi0 Mar 6, 2023
afd079a
update arguments in example srcipt
RaymondLi0 Mar 7, 2023
a16826e
add multi-validation for gpt training
RaymondLi0 Mar 9, 2023
c73ff5c
add subset argument to preprocessing
Mar 9, 2023
1d7768a
add valid-num-workers argument
RaymondLi0 Mar 9, 2023
3032821
change fim special tokens to use underscore
RaymondLi0 Mar 10, 2023
4978321
log tflops
RaymondLi0 Mar 10, 2023
3a6286b
make assert less strict for very small datasets (typically when one e…
RaymondLi0 Mar 10, 2023
a950409
fix fim for new tokenizer
RaymondLi0 Mar 10, 2023
294ef35
fix fim
RaymondLi0 Mar 10, 2023
9ce9611
more explicit error when trying to create empty splits
RaymondLi0 Mar 10, 2023
654d0d8
add multi-validation for gpt training (#32)
RaymondLi0 Mar 21, 2023
042a091
Take MQA into account in flops formula, fix glu-activation factor
RaymondLi0 Mar 21, 2023
e280d3a
Merge branch 'multi-query-attention' into log-tflops
RaymondLi0 Mar 21, 2023
b18ecf6
adjust formula in comments
RaymondLi0 Mar 21, 2023
b0f3cfb
Merge branch 'multi-query-attention' into remove_hf_transformers
jlamypoirier Mar 21, 2023
e969456
Merge pull request #26 from bigcode-project/remove_hf_transformers
RaymondLi0 Mar 21, 2023
659295a
Kv grad allreduce v2 (#39)
jlamypoirier Mar 21, 2023
bd12802
support mqa in checkpoint-merging tools
RaymondLi0 Mar 22, 2023
8b38744
Merge pull request #33 from bigcode-project/log-tflops
RaymondLi0 Mar 22, 2023
7d5154f
add flash-attn
RaymondLi0 Mar 22, 2023
118f0a8
flash-attn: assert that alibi is not used
RaymondLi0 Mar 23, 2023
d50a89b
fix import
RaymondLi0 Mar 23, 2023
61fe86d
update readme
RaymondLi0 Mar 23, 2023
f5019c8
raise if using flash-attn with selective recomputation, swap if/else
RaymondLi0 Mar 24, 2023
0ff5746
change back to warning
RaymondLi0 Mar 24, 2023
e0b644b
Merge pull request #41 from bigcode-project/flash-attention
RaymondLi0 Mar 24, 2023
36d0435
add token/s/gpu to wandb
Mar 29, 2023
b691302
fix distributed optimizer
Mar 29, 2023
c41f2b1
Merge pull request #43 from bigcode-project/tokens-per-second-gpu
lvwerra Mar 29, 2023
86ba4c0
Merge pull request #44 from bigcode-project/fix-dist-opt
lvwerra Mar 29, 2023
a8e64f6
support checkpoints with distrib optimizer in checkpoint-util
RaymondLi0 Apr 3, 2023
57f21b7
don't load optimizer instead of arbitrarily loading dp-rank 0
RaymondLi0 Apr 3, 2023
22b8611
add bigcode model slurm script
Apr 14, 2023
1a7d54b
Merge pull request #40 from bigcode-project/mqa-checkpoint-utils
RaymondLi0 May 8, 2023
c988cf2
Update slurm script
loubnabnl May 12, 2023
0048491
Finetune StarCoder Megatron
lvwerra May 16, 2023
893aaa5
assert Flash Attention doesn't get arbitrary mask
mayank31398 May 22, 2023
d06e737
fix dtypes for new numpy versions
mayank31398 May 22, 2023
0e2415a
fused layer norm
mayank31398 May 22, 2023
041b733
move cuda kernels
mayank31398 May 22, 2023
28780a7
add rocm
mayank31398 May 22, 2023
beaf2f2
Merge pull request #53 from mayank31398/error-reset
RaymondLi0 May 23, 2023
5432115
Merge branch 'multi-query-attention' into ontocord
mayank31398 May 26, 2023
22de429
Add tokens-per-second-per-gpu to the printed logs instead of just wan…
loubnabnl May 26, 2023
6a77fd0
fix
mayank31398 May 26, 2023
9008fbe
fused
mayank31398 May 26, 2023
23cf759
add missing get_batch_per_block
mayank31398 May 31, 2023
f20d10a
increase sequence length to 8k
mayank31398 May 31, 2023
cc965d9
don't use Apex kernels
mayank31398 Jun 2, 2023
e9a7e7e
8192 upper
mayank31398 Jun 2, 2023
250ab29
8192 upper
mayank31398 Jun 2, 2023
e8c74d5
drop useless script
mayank31398 Jun 2, 2023
648f32f
Merge branch 'main' of github.com:NVIDIA/Megatron-LM into NVIDIA-main
RaymondLi0 Jun 3, 2023
4a33f29
fused kernel import
mayank31398 Jun 3, 2023
21045b5
drop use_kernels_from_apex
mayank31398 Jun 3, 2023
b4efd14
Merge pull request #55 from mayank31398/ontocord
RaymondLi0 Jun 5, 2023
8354f89
Merge branch 'multi-query-attention' into NVIDIA-main
RaymondLi0 Jun 5, 2023
972f301
remove unused kernels
RaymondLi0 Jun 5, 2023
b291323
Create finetune_starcoderplus.slurm
loubnabnl Jun 12, 2023
203b071
try with LayerNorm import from megatron.model
Jun 12, 2023
48c8046
fix the merge
Jun 12, 2023
ac497ce
move setting of TORCH_CUDA_ARCH_LIST
Jun 13, 2023
04031a8
fix call to blendable dataset
Jun 14, 2023
17217f8
fix blended dataset size in dataset groups
RaymondLi0 Jun 14, 2023
39a75ee
Merge pull request #52 from bigcode-project/finetune-starcoder
RaymondLi0 Jun 16, 2023
0229a69
find_checkpoint_rank_0 returns a single value
RaymondLi0 Jun 19, 2023
37353b1
fix checkpoint merge tools
RaymondLi0 Jun 19, 2023
3dbd929
remove --finetune-from argument to make checkpoint loading logic simpler
RaymondLi0 Jun 19, 2023
3e22c9f
Merge pull request #58 from bigcode-project/NVIDIA-main
RaymondLi0 Jun 19, 2023
1397ac0
Merge branch 'multi-query-attention' into loubnabnl-patch-1
RaymondLi0 Jun 19, 2023
8c1889e
Merge pull request #54 from bigcode-project/loubnabnl-patch-1
RaymondLi0 Jun 19, 2023
8196de1
Skip unnecessary compilation
jlamypoirier Jun 21, 2023
2223891
Merge pull request #65 from bigcode-project/skip_compile
RaymondLi0 Jun 21, 2023
5b06c12
Create pretrain_bigcode_7b.slurm
loubnabnl Jun 28, 2023
513d00d
Create pretrain_bigcode_1b.slurm
loubnabnl Jul 3, 2023
5a9c239
Create pretrain_bigcode_3b.slurm
loubnabnl Jul 3, 2023
a993f05
outputs not matching non-flash case in MQA
mayank31398 Jul 18, 2023
1809fc1
Merge branch 'multi-query-attention' into mqa
mayank31398 Jul 18, 2023
c82a5a1
Merge pull request #71 from mayank31398/mqa
RaymondLi0 Jul 19, 2023
ebea9f2
convert reshape to view (#73)
mayank31398 Jul 21, 2023
462980b
Support flash attn 2 (#72)
jlamypoirier Jul 21, 2023
ebd38e9
Fix train-iters typo & format script (#74)
huybery Jul 24, 2023
bd0aaba
Merge pull request #70 from bigcode-project/script_7b-starcoder
loubnabnl Aug 14, 2023
f598110
add file level FIM and sanity check
loubnabnl Nov 10, 2023
fd6d705
use default None for sanity check interval
loubnabnl Nov 10, 2023
4f8a0e4
remove extra prints
loubnabnl Nov 10, 2023
25f3c89
Merge pull request #81 from bigcode-project/file-level-fim
loubnabnl Nov 10, 2023
01e9ce6
add thera rope as arg
loubnabnl Nov 10, 2023
c8372cb
add humaneval generations using a server
loubnabnl Nov 14, 2023
7c325cd
Revert "add humaneval generations using a server"
loubnabnl Nov 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 144 additions & 0 deletions examples/finetune_bigcode_model.slurm
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
#!/bin/bash
#SBATCH --job-name=starcoderpy
#SBATCH --nodes=64
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --gres=gpu:8
#SBATCH --partition=production-cluster
#SBATCH --output=/fsx/leandro/logs/starcoderpy/bcs-%x-%j.out

set -x -e
source /admin/home/leandro/.bashrc

conda activate megatron

echo "START TIME: $(date)"

# File Path setup
SCRIPT_REPO=/fsx/leandro/git/Megatron-LM-BC
pushd $SCRIPT_REPO

LOG_PATH=$SCRIPT_REPO/main_log.txt

# Training setup
GPUS_PER_NODE=8
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000
NNODES=$SLURM_NNODES
NODE_RANK=$SLURM_PROCID
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

# File path setup
STARCODER_PATH=/fsx/boomcode/starcoder/
CHECKPOINT_PATH=/fsx/boomcode/starcoderpy/$SLURM_JOB_ID
TOKENIZER_FILE=/fsx/boomcode/tokenizer-starcoder/tokenizer.json
WEIGHTS_TRAIN=/fsx/boomcode/datamix_python/train_data_paths.txt.tmp
WEIGHTS_VALID=/fsx/boomcode/datamix_python/valid_data_paths.txt.tmp
DATA_PATH=/fsx/boomcode/tokenized/python/
mkdir -p $CHECKPOINT_PATH/tensorboard

GPT_ARGS="\
--tensor-model-parallel-size 4 \
--pipeline-model-parallel-size 4 \
--sequence-parallel \
--num-layers 40 \
--hidden-size 6144 \
--num-attention-heads 48 \
--attention-head-type multiquery \
--init-method-std 0.01275 \
--seq-length 8192 \
--max-position-embeddings 8192 \
--attention-dropout 0.1 \
--hidden-dropout 0.1 \
--micro-batch-size 1 \
--global-batch-size 512 \
--lr 0.00005 \
--min-lr 0.000005 \
--train-iters 258500 \
--lr-decay-iters 8500 \
--lr-decay-style cosine \
--lr-warmup-iters 500 \
--weight-decay .1 \
--adam-beta2 .95 \
--clip-grad 1.0 \
--bf16 \
--use-flash-attn \
--fim-rate 0.5 \
--log-interval 10 \
--save-interval 2500 \
--eval-interval 100 \
--eval-iters 10 \
--valid-num-workers 0 \
--override-opt_param-scheduler \
--no-load-optim \
--no-load-rng \
--finetune \
"

# --dataloader-type cyclic\
TENSORBOARD_ARGS="--tensorboard-dir ${CHECKPOINT_PATH}/tensorboard"

CMD=" \
$SCRIPT_REPO/pretrain_gpt.py \
$GPT_ARGS \
--tokenizer-type TokenizerFromFile \
--tokenizer-file $TOKENIZER_FILE \
--save $CHECKPOINT_PATH \
--load $STARCODER_PATH \
--train-weighted-split-paths-path $WEIGHTS_TRAIN \
--valid-weighted-split-paths-path $WEIGHTS_VALID \
--structured-logs \
--structured-logs-dir $CHECKPOINT_PATH/logs \
$TENSORBOARD_ARGS \
--wandb-entity-name lvwerra \
--wandb-project-name starcoder-py \
"

# --data-path $DATA_PATH\gpt2-preprocessed_content_document

export LAUNCHER="python -u -m torch.distributed.run \
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
--rdzv_backend c10d \
--max_restarts 0 \
--tee 3 \
"

echo $CMD

# hide duplicated errors using this hack - will be properly fixed in pt-1.12
# export TORCHELASTIC_ERROR_FILE=/tmp/torch-elastic-error.json

# force crashing on nccl issues like hanging broadcast
export NCCL_ASYNC_ERROR_HANDLING=1
# export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=COLL
# export NCCL_SOCKET_NTHREADS=1
# export NCCL_NSOCKS_PERTHREAD=1
# export CUDA_LAUNCH_BLOCKING=1

# AWS specific
export NCCL_PROTO=simple
export RDMAV_FORK_SAFE=1
export FI_EFA_FORK_SAFE=1
export FI_EFA_USE_DEVICE_RDMA=1
export FI_PROVIDER=efa
export FI_LOG_LEVEL=1
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=ens

export CUDA_HOME=/usr/local/cuda-11.6

# srun error handling:
# --wait=60: wait 60 sec after the first task terminates before terminating all remaining tasks
# --kill-on-bad-exit=1: terminate a step if any task exits with a non-zero exit code
SRUN_ARGS=" \
--wait=60 \
--kill-on-bad-exit=1 \
"

# py-spy top -s -i -n -- $LAUNCHER --node_rank $SLURM_PROCID --role $SLURMD_NODENAME: $CMD
clear; srun $SRUN_ARGS --jobid $SLURM_JOB_ID bash -c "$LAUNCHER --node_rank \$SLURM_PROCID --role \$SLURMD_NODENAME: $CMD" 2>&1 | tee $LOG_PATH

echo "END TIME: $(date)"
141 changes: 141 additions & 0 deletions examples/finetune_starcoderplus.slurm
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
#!/bin/bash
#SBATCH --job-name=starcoderplus
#SBATCH --nodes=64
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --gres=gpu:8
#SBATCH --partition=production-cluster
#SBATCH --output=/fsx/leandro/logs/starcoderplus/bcs-%x-%j.out

set -x -e
source /admin/home/leandro/.bashrc

conda activate megatron

echo "START TIME: $(date)"

# File Path setup
SCRIPT_REPO=/fsx/leandro/git/Megatron-LM-BC
pushd $SCRIPT_REPO

LOG_PATH=$SCRIPT_REPO/main_log.txt

# Training setup
GPUS_PER_NODE=8
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000
NNODES=$SLURM_NNODES
NODE_RANK=$SLURM_PROCID
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

# File path setup
STARCODER_PATH=/fsx/boomcode/starcoder/
CHECKPOINT_PATH=/fsx/boomcode/starcoderplus/$SLURM_JOB_ID
TOKENIZER_FILE=/fsx/boomcode/tokenizer-starcoder/tokenizer.json
WEIGHTS_TRAIN=/fsx/boomcode/datamix/train_data_paths.txt.tmp
WEIGHTS_VALID=/fsx/boomcode/datamix/valid_data_paths.txt.tmp

mkdir -p $CHECKPOINT_PATH/tensorboard

GPT_ARGS="\
--tensor-model-parallel-size 4 \
--pipeline-model-parallel-size 4 \
--sequence-parallel \
--num-layers 40 \
--hidden-size 6144 \
--num-attention-heads 48 \
--attention-head-type multiquery \
--init-method-std 0.01275 \
--seq-length 8192 \
--max-position-embeddings 8192 \
--attention-dropout 0.1 \
--hidden-dropout 0.1 \
--micro-batch-size 1 \
--global-batch-size 512 \
--lr 0.0001 \
--min-lr 0.00001 \
--train-iters 400000 \
--lr-decay-iters 150000 \
--lr-decay-style cosine \
--lr-warmup-iters 1000 \
--weight-decay .1 \
--adam-beta2 .95 \
--clip-grad 1.0 \
--bf16 \
--use-flash-attn \
--fim-rate 0.5 \
--log-interval 10 \
--save-interval 2500 \
--eval-interval 2500 \
--eval-iters 2 \
--valid-num-workers 0 \
--override-opt_param-scheduler \
--no-load-optim \
--no-load-rng \
--finetune \
"

TENSORBOARD_ARGS="--tensorboard-dir ${CHECKPOINT_PATH}/tensorboard"

CMD=" \
$SCRIPT_REPO/pretrain_gpt.py \
$GPT_ARGS \
--tokenizer-type TokenizerFromFile \
--tokenizer-file $TOKENIZER_FILE \
--save $CHECKPOINT_PATH \
--load $STARCODER_PATH \
--train-weighted-split-paths-path $WEIGHTS_TRAIN \
--valid-weighted-split-paths-path $WEIGHTS_VALID \
--structured-logs \
--structured-logs-dir $CHECKPOINT_PATH/logs \
$TENSORBOARD_ARGS \
--wandb-entity-name lvwerra \
--wandb-project-name starcoder-plus \
"

export LAUNCHER="python -u -m torch.distributed.run \
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
--rdzv_backend c10d \
--max_restarts 0 \
--tee 3 \
"

echo $CMD

# hide duplicated errors using this hack - will be properly fixed in pt-1.12
# export TORCHELASTIC_ERROR_FILE=/tmp/torch-elastic-error.json

# force crashing on nccl issues like hanging broadcast
export NCCL_ASYNC_ERROR_HANDLING=1
# export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=COLL
# export NCCL_SOCKET_NTHREADS=1
# export NCCL_NSOCKS_PERTHREAD=1
# export CUDA_LAUNCH_BLOCKING=1

# AWS specific
export NCCL_PROTO=simple
export RDMAV_FORK_SAFE=1
export FI_EFA_FORK_SAFE=1
export FI_EFA_USE_DEVICE_RDMA=1
export FI_PROVIDER=efa
export FI_LOG_LEVEL=1
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=ens

export CUDA_HOME=/usr/local/cuda-11.6

# srun error handling:
# --wait=60: wait 60 sec after the first task terminates before terminating all remaining tasks
# --kill-on-bad-exit=1: terminate a step if any task exits with a non-zero exit code
SRUN_ARGS=" \
--wait=60 \
--kill-on-bad-exit=1 \
"

# py-spy top -s -i -n -- $LAUNCHER --node_rank $SLURM_PROCID --role $SLURMD_NODENAME: $CMD
clear; srun $SRUN_ARGS --jobid $SLURM_JOB_ID bash -c "$LAUNCHER --node_rank \$SLURM_PROCID --role \$SLURMD_NODENAME: $CMD" 2>&1 | tee $LOG_PATH

echo "END TIME: $(date)"
Loading