Skip to content

Commit e897fd7

Browse files
committed
fix: adds missing support for mcore dist opt and adds test for moe
Signed-off-by: Terry Kong <[email protected]> moe test is all2all Signed-off-by: Terry Kong <[email protected]> other params Signed-off-by: Terry Kong <[email protected]> fix peft mixtral Signed-off-by: Terry Kong <[email protected]> dockerfile bump to be on dev Signed-off-by: Terry Kong <[email protected]> just take dockerfile on dev Signed-off-by: Terry Kong <[email protected]>
1 parent b0dd4d5 commit e897fd7

File tree

21 files changed

+185
-51
lines changed

21 files changed

+185
-51
lines changed

.github/workflows/cicd-main.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,8 @@ jobs:
9595
- kd-llama3
9696
- sft-llama3
9797
- rm-llama3
98+
- dpo-mixtral-ep
99+
- dpo-mixtral-peft-tp-sp
98100
with:
99101
RUNNER: self-hosted-azure
100102
# Fairly aggresive timeout that all functional tests should try to adhere to

Dockerfile

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,8 @@ ARG MAX_JOBS=8
1313
# Git refs for dependencies
1414
ARG TE_TAG=7d576ed25266a17a7b651f2c12e8498f67e0baea
1515
ARG PYTRITON_VERSION=0.5.10
16-
ARG NEMO_TAG=19668e5320a2e2af0199b6d5e0b841993be3a634 # On: main
17-
ARG MLM_TAG=25059d3bbf68be0751800f3644731df12a88f3f3 # On: main
16+
ARG NEMO_TAG=06eae2895c0fea09f8dd7c34feff0163e55c419a # On: main
17+
ARG MLM_TAG=844119f5c856a3037ec7c7f6d6ef7b3518ceee6b # On: main
1818
ARG ALIGNER_COMMIT=main
1919
ARG TRTLLM_VERSION=v0.13.0
2020
ARG PROTOBUF_VERSION=4.24.4
@@ -123,19 +123,19 @@ RUN cd /opt/NeMo-Aligner && \
123123

124124
RUN cd TensorRT-LLM && patch -p1 < ../NeMo-Aligner/setup/trtllm.patch
125125

126-
# TODO(terryk): This layer should be deleted ASAP after NeMo is bumped to include all of these PRs
126+
# NOTE: Comment this layer out if it is not needed
127+
# NOTE: This section exists to allow cherry-picking PRs in cases where
128+
# we do not wish to simply update to the top-of-tree. Sometimes PRs
129+
# cannot be cherry-picked cleanly if rebased a few times to top-of-tree
130+
# so this logic also requires you to select a SHA (can be dangling) from
131+
# the PR.
127132
RUN <<"EOF" bash -exu
128133
cd NeMo
129134
# Ensures we don't cherry-pick "future" origin/main commits
130135
git fetch -a
131-
# 0c92fe17df4642ffc33d5d8c0c83fda729e3910c: [fix] Ensures disabling exp_manager with exp_manager=null does not error NeMo#10651
132-
# 60e677423667c029dd05875da72bf0719774f844: [feat] Update get_model_parallel_src_rank to support tp-pp-dp ordering NeMo#10652
133-
# 0deaf6716cb4f20766c995ce25d129795f1ae200: fix[export]: update API for disabling device reassignment in TRTLLM for Aligner NeMo#10863
134-
# (superceded by 10863) 148543d6e9c66ff1f8562e84484448202249811d: feat: Migrate GPTSession refit path in Nemo export to ModelRunner for Aligner NeMo#10654
136+
# d27dd28b4186f6ecd9f46f1c5679a5eef9bad14e: fix: export weight name mapping if model is nemo model#11497
135137
for pr_and_commit in \
136-
"10651 0c92fe17df4642ffc33d5d8c0c83fda729e3910c" \
137-
"10652 60e677423667c029dd05875da72bf0719774f844" \
138-
"10863 0deaf6716cb4f20766c995ce25d129795f1ae200" \
138+
"11497 d27dd28b4186f6ecd9f46f1c5679a5eef9bad14e" \
139139
; do
140140
pr=$(cut -f1 -d' ' <<<"$pr_and_commit")
141141
head_pr_commit=$(cut -f2 -d' ' <<<"$pr_and_commit")

examples/nlp/gpt/conf/gpt_dpo.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ trainer:
66
devices: 8
77
accelerator: gpu
88
precision: bf16
9+
gradient_clip_val: 0.0 # No need to change. Megatron Core optimizer uses this value
910

1011
# dpo specific args
1112
dpo:
@@ -17,6 +18,7 @@ trainer:
1718

1819
# how many GBS we loop over
1920
limit_val_batches: 1.0
21+
# TODO: delete once Megatron Core optimizer becomes default
2022
gradient_clip_val: 1.0
2123

2224
# do not change these

examples/nlp/gpt/conf/gpt_kto.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ trainer:
66
devices: 8
77
accelerator: gpu
88
precision: bf16
9+
gradient_clip_val: 0.0 # No need to change. Megatron Core optimizer uses this value
910

1011
# kto specific args
1112
kto:
@@ -17,6 +18,7 @@ trainer:
1718

1819
# how many GBS we loop over
1920
limit_val_batches: 1.0
21+
# TODO: delete once Megatron Core optimizer becomes default
2022
gradient_clip_val: 1.0
2123

2224
# do not change these

examples/nlp/gpt/conf/gpt_ppo_actor.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ trainer:
77
devices: 8
88
accelerator: gpu
99
precision: bf16
10+
gradient_clip_val: 0.0 # No need to change. Megatron Core optimizer uses this value
1011

1112
ppo:
1213
# How many steps we train warmup the critic for (without training the policy)
@@ -21,6 +22,7 @@ trainer:
2122
max_steps: -1 # max PPO steps (-1 to go through the whole train set)
2223
val_check_interval: 10
2324
save_interval: ${.val_check_interval}
25+
# TODO: delete once Megatron Core optimizer becomes default
2426
gradient_clip_val: 1.0
2527

2628
# PPO args to generate the data for training

examples/nlp/gpt/conf/gpt_ppo_critic.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ trainer:
66
devices: 8
77
accelerator: gpu
88
precision: bf16
9+
gradient_clip_val: 0.0 # No need to change. Megatron Core optimizer uses this value
910

1011
ppo:
1112
port: 5556
@@ -15,6 +16,7 @@ trainer:
1516

1617
# used to set the learning rate scheduler
1718
max_steps: 10000
19+
# TODO: delete once Megatron Core optimizer becomes default
1820
gradient_clip_val: 1.0
1921

2022
# a PyTriton parameter to specify

examples/nlp/gpt/conf/gpt_rs_actor.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,14 @@ trainer:
77
devices: 8
88
accelerator: gpu
99
precision: bf16
10+
gradient_clip_val: 0.0 # No need to change. Megatron Core optimizer uses this value
1011

1112
rs:
1213
max_epochs: 1
1314
max_steps: -1 # max rs steps (-1 to go through the whole train set)
1415
val_check_interval: 10
1516
save_interval: ${.val_check_interval}
17+
# TODO: delete once Megatron Core optimizer becomes default
1618
gradient_clip_val: 1.0
1719

1820
# pick up from the model
@@ -177,4 +179,4 @@ model:
177179
# define fields from the base model's config that should be ignored when merging with this config.
178180
overwrite_base_config:
179181
data:
180-
data_prefix: True
182+
data_prefix: True

examples/nlp/gpt/conf/gpt_sft.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ trainer:
55
devices: 1
66
accelerator: gpu
77
precision: bf16
8+
gradient_clip_val: 0.0 # No need to change. Megatron Core optimizer uses this value
89

910
sft:
1011
max_epochs: 1
@@ -15,6 +16,7 @@ trainer:
1516
limit_train_batches: 1.0
1617

1718
limit_val_batches: 1.0
19+
# TODO: delete once Megatron Core optimizer becomes default
1820
gradient_clip_val: 1.0
1921

2022
# can be used to register any custom metrics that require token-by-token generation

examples/nlp/gpt/conf/gpt_spin.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ trainer:
66
devices: 8
77
accelerator: gpu
88
precision: bf16-mixed
9+
gradient_clip_val: 0.0 # No need to change. Megatron Core optimizer uses this value
910

1011
# spin specific args
1112
spin:
@@ -18,6 +19,7 @@ trainer:
1819

1920
# how many GBS we loop over
2021
limit_val_batches: 1.0
22+
# TODO: delete once Megatron Core optimizer becomes default
2123
gradient_clip_val: 1.0
2224

2325
# do not change these

examples/nlp/gpt/conf/training_rm.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ trainer:
66
devices: 8
77
accelerator: gpu
88
precision: bf16
9+
gradient_clip_val: 0.0 # No need to change. Megatron Core optimizer uses this value
910

1011
# rm specific args
1112
rm:
@@ -20,6 +21,7 @@ trainer:
2021
# set to float for a percentage
2122
# of the validation dataset
2223
limit_val_batches: 1.0
24+
# TODO: delete once Megatron Core optimizer becomes default
2325
gradient_clip_val: 1.0
2426

2527
# do not change these

0 commit comments

Comments
 (0)