Skip to content

Commit

Permalink
Merge branch 'ko3n1g/ci/repeat-mrs' into 'main'
Browse files Browse the repository at this point in the history
tests: Repeat MRs 5 times

See merge request ADLR/megatron-lm!2004
  • Loading branch information
ko3n1g committed Sep 12, 2024
2 parents 9ec2337 + e5fb1fa commit 028b777
Show file tree
Hide file tree
Showing 128 changed files with 235 additions and 148 deletions.
2 changes: 1 addition & 1 deletion tests/functional_tests/jet_recipes/bert.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ spec:
products:
- scope: [mr]
time_limit: [1200]
time_limit: [12000]
test_case:
- bert_mr_mcore_tp2_pp2_dgx_a100_1N8G
- bert_mr_mcore_tp2_pp2_local_spec_dgx_a100_1N8G
Expand Down
2 changes: 1 addition & 1 deletion tests/functional_tests/jet_recipes/gpt-nemo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ spec:
nodes: 1
gpus: 8
platforms: dgx_a100
time_limit: 1200
time_limit: 12000
scope: null
script: |-
ls
Expand Down
2 changes: 1 addition & 1 deletion tests/functional_tests/jet_recipes/gpt.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ spec:
products:
- scope: [mr]
platforms: [dgx_a100]
time_limit: [1200]
time_limit: [12000]
test_case:
- gpt3_mr_mcore_te_tp1_pp1_dist_optimizer_no_mmap_bin_files_dgx_a100_1N8G
- gpt3_mr_mcore_te_tp1_pp1_resume_torch_dist_dist_optimizer_dgx_a100_1N8G
Expand Down
2 changes: 1 addition & 1 deletion tests/functional_tests/jet_recipes/multimodal-llava.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ spec:
nodes: 1
gpus: 8
platforms: dgx_a100
time_limit: 1200
time_limit: 12000
scope: null
script: |-
ls
Expand Down
2 changes: 1 addition & 1 deletion tests/functional_tests/jet_recipes/t5.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ spec:
products:
- scope: [mr]
time_limit: [1200]
time_limit: [12000]
test_case:
- t5_220m_mr_mcore_tp2_pp2_dgx_a100_1N8G
- t5_220m_mr_mcore_tp2_pp2_resume_torch_dgx_a100_1N8G
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 10
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 24
--hidden-size: 1024
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 10
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 24
--hidden-size: 1024
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 10
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 24
--hidden-size: 1024
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ ENV_VARS:
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
NVTE_APPLY_QK_LAYER_SCALING: 1
N_REPEATS: 10
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 24
--hidden-size: 1024
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ ENV_VARS:
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
NVTE_APPLY_QK_LAYER_SCALING: 1
N_REPEATS: 10
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 24
--hidden-size: 1024
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 24
--hidden-size: 1024
Expand Down Expand Up @@ -40,4 +41,4 @@ MODEL_ARGS:
--data-cache-path: ${DATA_CACHE_PATH}
--bf16: true
--ckpt-format: torch
TEST_TYPE: regular
TEST_TYPE: regular
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 24
--hidden-size: 1024
Expand Down Expand Up @@ -35,10 +36,10 @@ MODEL_ARGS:
--eval-iters: 10
--tensor-model-parallel-size: 2
--pipeline-model-parallel-size: 2
--spec: local
--deterministic-mode: true
--spec: local
--deterministic-mode: true
--no-gradient-accumulation-fusion: true
--data-cache-path: ${DATA_CACHE_PATH}
--bf16: true
--ckpt-format: torch
TEST_TYPE: regular
TEST_TYPE: regular
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 24
--hidden-size: 1024
Expand Down Expand Up @@ -35,11 +36,11 @@ MODEL_ARGS:
--eval-iters: 10
--tensor-model-parallel-size: 2
--pipeline-model-parallel-size: 2
--deterministic-mode: true
--use-checkpoint-args: true
--deterministic-mode: true
--use-checkpoint-args: true
--use-checkpoint-opt_param-scheduler: true
--no-gradient-accumulation-fusion: true
--data-cache-path: ${DATA_CACHE_PATH}
--bf16: true
--ckpt-format: torch
TEST_TYPE: ckpt-resume
TEST_TYPE: ckpt-resume
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 24
--hidden-size: 1024
Expand Down Expand Up @@ -35,12 +36,12 @@ MODEL_ARGS:
--eval-iters: 10
--tensor-model-parallel-size: 2
--pipeline-model-parallel-size: 2
--spec: local
--deterministic-mode: true
--use-checkpoint-args: true
--spec: local
--deterministic-mode: true
--use-checkpoint-args: true
--use-checkpoint-opt_param-scheduler: true
--no-gradient-accumulation-fusion: true
--data-cache-path: ${DATA_CACHE_PATH}
--bf16: true
--ckpt-format: torch
TEST_TYPE: ckpt-resume
TEST_TYPE: ckpt-resume
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 24
--hidden-size: 1024
Expand Down Expand Up @@ -44,4 +45,4 @@ MODEL_ARGS:
--fp16: true
--apply-query-key-layer-scaling: true
--ckpt-format: torch
TEST_TYPE: regular
TEST_TYPE: regular
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 24
--hidden-size: 1024
Expand Down Expand Up @@ -37,13 +38,13 @@ MODEL_ARGS:
--pipeline-model-parallel-size: 4
--num-layers-per-virtual-pipeline-stage: 2
--use-legacy-models: true
--transformer-impl: local
--deterministic-mode: true
--use-checkpoint-args: true
--transformer-impl: local
--deterministic-mode: true
--use-checkpoint-args: true
--use-checkpoint-opt_param-scheduler: true
--no-gradient-accumulation-fusion: true
--data-cache-path: ${DATA_CACHE_PATH}
--fp16: true
--fp16: true
--apply-query-key-layer-scaling: true
--ckpt-format: torch
TEST_TYPE: ckpt-resume
TEST_TYPE: ckpt-resume
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 24
--hidden-size: 1024
Expand Down Expand Up @@ -36,11 +37,11 @@ MODEL_ARGS:
--tensor-model-parallel-size: 2
--pipeline-model-parallel-size: 2
--use-legacy-models: true
--transformer-impl: local
--transformer-impl: local
--deterministic-mode: true
--no-gradient-accumulation-fusion: true
--data-cache-path: ${DATA_CACHE_PATH}
--fp16: true
--fp16: true
--apply-query-key-layer-scaling: true
--ckpt-format: torch
TEST_TYPE: regular
TEST_TYPE: regular
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 24
--hidden-size: 1024
Expand Down Expand Up @@ -36,13 +37,13 @@ MODEL_ARGS:
--tensor-model-parallel-size: 2
--pipeline-model-parallel-size: 2
--use-legacy-models: true
--transformer-impl: local
--deterministic-mode: true
--use-checkpoint-args: true
--transformer-impl: local
--deterministic-mode: true
--use-checkpoint-args: true
--use-checkpoint-opt_param-scheduler: true
--no-gradient-accumulation-fusion: true
--data-cache-path: ${DATA_CACHE_PATH}
--fp16: true
--apply-query-key-layer-scaling: true
--ckpt-format: torch
TEST_TYPE: ckpt-resume
TEST_TYPE: ckpt-resume
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
ENV_VARS:
CUDA_DEVICE_MAX_CONNECTIONS: 1
SKIP_PYTEST: 1
N_REPEATS: 1
MODEL_ARGS:
trainer.num_nodes: 1
trainer.devices: 8
Expand Down Expand Up @@ -32,4 +33,4 @@ MODEL_ARGS:
model.sequence_parallel: 'True'
model.overlap_p2p_comm: 'True'
model.batch_p2p_comm: 'False'
TEST_TYPE: regular
TEST_TYPE: regular
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
ENV_VARS:
CUDA_DEVICE_MAX_CONNECTIONS: 1
SKIP_PYTEST: 1
N_REPEATS: 1
MODEL_ARGS:
trainer.num_nodes: 1
trainer.devices: 8
Expand Down Expand Up @@ -29,4 +30,4 @@ MODEL_ARGS:
model.optim.name: distributed_fused_adam
model.optim.weight_decay: 0.1
exp_manager.create_checkpoint_callback: 'False'
TEST_TYPE: regular
TEST_TYPE: regular
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 10
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 12
--hidden-size: 512
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 10
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 12
--hidden-size: 512
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 10
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 12
--hidden-size: 512
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 10
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 12
--hidden-size: 512
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 10
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 12
--hidden-size: 512
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 10
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 12
--hidden-size: 512
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 10
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 12
--hidden-size: 512
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 10
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 12
--hidden-size: 512
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 10
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 12
--hidden-size: 512
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ ENV_VARS:
NVTE_ALLOW_NONDETERMINISTIC_ALGO: 0
NCCL_ALGO: Tree
CUBLAS_WORKSPACE_CONFIG: :4096:8
N_REPEATS: 10
N_REPEATS: 5
MODEL_ARGS:
--num-layers: 12
--hidden-size: 512
Expand Down
Loading

0 comments on commit 028b777

Please sign in to comment.