Skip to content

Commit eab96f2

Browse files
committed
fix: adds missing support for mcore dist opt and adds test for moe
Signed-off-by: Terry Kong <[email protected]> moe test is all2all Signed-off-by: Terry Kong <[email protected]> other params Signed-off-by: Terry Kong <[email protected]> fix peft mixtral Signed-off-by: Terry Kong <[email protected]> dockerfile bump to be on dev Signed-off-by: Terry Kong <[email protected]> just take dockerfile on dev Signed-off-by: Terry Kong <[email protected]>
1 parent 3604fc4 commit eab96f2

File tree

20 files changed

+175
-41
lines changed

20 files changed

+175
-41
lines changed

.github/workflows/cicd-main.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,8 @@ jobs:
9696
- sft-llama3
9797
- sft-llama3-cp
9898
- rm-llama3
99+
- dpo-mixtral-ep
100+
- dpo-mixtral-peft-tp-sp
99101
with:
100102
RUNNER: self-hosted-azure
101103
# Fairly aggresive timeout that all functional tests should try to adhere to

examples/nlp/gpt/conf/gpt_dpo.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ trainer:
66
devices: 8
77
accelerator: gpu
88
precision: bf16
9+
gradient_clip_val: 0.0 # No need to change. Megatron Core optimizer uses this value
910

1011
# dpo specific args
1112
dpo:
@@ -17,6 +18,7 @@ trainer:
1718

1819
# how many GBS we loop over
1920
limit_val_batches: 1.0
21+
# TODO: delete once Megatron Core optimizer becomes default
2022
gradient_clip_val: 1.0
2123

2224
# do not change these

examples/nlp/gpt/conf/gpt_kto.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ trainer:
66
devices: 8
77
accelerator: gpu
88
precision: bf16
9+
gradient_clip_val: 0.0 # No need to change. Megatron Core optimizer uses this value
910

1011
# kto specific args
1112
kto:
@@ -17,6 +18,7 @@ trainer:
1718

1819
# how many GBS we loop over
1920
limit_val_batches: 1.0
21+
# TODO: delete once Megatron Core optimizer becomes default
2022
gradient_clip_val: 1.0
2123

2224
# do not change these

examples/nlp/gpt/conf/gpt_ppo_actor.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ trainer:
77
devices: 8
88
accelerator: gpu
99
precision: bf16
10+
gradient_clip_val: 0.0 # No need to change. Megatron Core optimizer uses this value
1011

1112
ppo:
1213
# How many steps we train warmup the critic for (without training the policy)
@@ -21,6 +22,7 @@ trainer:
2122
max_steps: -1 # max PPO steps (-1 to go through the whole train set)
2223
val_check_interval: 10
2324
save_interval: ${.val_check_interval}
25+
# TODO: delete once Megatron Core optimizer becomes default
2426
gradient_clip_val: 1.0
2527

2628
# PPO args to generate the data for training

examples/nlp/gpt/conf/gpt_ppo_critic.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ trainer:
66
devices: 8
77
accelerator: gpu
88
precision: bf16
9+
gradient_clip_val: 0.0 # No need to change. Megatron Core optimizer uses this value
910

1011
ppo:
1112
port: 5556
@@ -15,6 +16,7 @@ trainer:
1516

1617
# used to set the learning rate scheduler
1718
max_steps: 10000
19+
# TODO: delete once Megatron Core optimizer becomes default
1820
gradient_clip_val: 1.0
1921

2022
# a PyTriton parameter to specify

examples/nlp/gpt/conf/gpt_rs_actor.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,14 @@ trainer:
77
devices: 8
88
accelerator: gpu
99
precision: bf16
10+
gradient_clip_val: 0.0 # No need to change. Megatron Core optimizer uses this value
1011

1112
rs:
1213
max_epochs: 1
1314
max_steps: -1 # max rs steps (-1 to go through the whole train set)
1415
val_check_interval: 10
1516
save_interval: ${.val_check_interval}
17+
# TODO: delete once Megatron Core optimizer becomes default
1618
gradient_clip_val: 1.0
1719

1820
# pick up from the model
@@ -178,4 +180,4 @@ model:
178180
# define fields from the base model's config that should be ignored when merging with this config.
179181
overwrite_base_config:
180182
data:
181-
data_prefix: True
183+
data_prefix: True

examples/nlp/gpt/conf/gpt_sft.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ trainer:
55
devices: 1
66
accelerator: gpu
77
precision: bf16
8+
gradient_clip_val: 0.0 # No need to change. Megatron Core optimizer uses this value
89

910
sft:
1011
max_epochs: 1
@@ -15,6 +16,7 @@ trainer:
1516
limit_train_batches: 1.0
1617

1718
limit_val_batches: 1.0
19+
# TODO: delete once Megatron Core optimizer becomes default
1820
gradient_clip_val: 1.0
1921

2022
# can be used to register any custom metrics that require token-by-token generation

examples/nlp/gpt/conf/gpt_spin.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ trainer:
66
devices: 8
77
accelerator: gpu
88
precision: bf16-mixed
9+
gradient_clip_val: 0.0 # No need to change. Megatron Core optimizer uses this value
910

1011
# spin specific args
1112
spin:
@@ -18,6 +19,7 @@ trainer:
1819

1920
# how many GBS we loop over
2021
limit_val_batches: 1.0
22+
# TODO: delete once Megatron Core optimizer becomes default
2123
gradient_clip_val: 1.0
2224

2325
# do not change these

examples/nlp/gpt/conf/training_rm.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ trainer:
66
devices: 8
77
accelerator: gpu
88
precision: bf16
9+
gradient_clip_val: 0.0 # No need to change. Megatron Core optimizer uses this value
910

1011
# rm specific args
1112
rm:
@@ -20,6 +21,7 @@ trainer:
2021
# set to float for a percentage
2122
# of the validation dataset
2223
limit_val_batches: 1.0
24+
# TODO: delete once Megatron Core optimizer becomes default
2325
gradient_clip_val: 1.0
2426

2527
# do not change these

nemo_aligner/algorithms/critic_server_trainer.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -322,7 +322,7 @@ def run_training(self, tokens=None, returns=None, prev_values=None, mask=None):
322322
grad_norm = grad_norm.item() if torch.is_tensor(grad_norm) else grad_norm
323323
lr = self.optimizer.param_groups[0]["lr"]
324324

325-
self.optimizer.step()
325+
self.optimizer.step(closure=None)
326326
self.scheduler.step()
327327

328328
if grad_norm is not None:

0 commit comments

Comments
 (0)