-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job hangs or IndexError when train reward model with PP> 1 #251
Comments
Would you be able to share your |
I use the default config: |
I was wondering about the |
Below is the content of mcore_gpt: true
micro_batch_size: 4
global_batch_size: 8
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
virtual_pipeline_model_parallel_size: null
encoder_seq_length: 4096
max_position_embeddings: 4096
num_layers: 32
hidden_size: 4096
ffn_hidden_size: 11008
num_attention_heads: 32
init_method_std: 0.02
use_scaled_init_method: true
hidden_dropout: 0.0
attention_dropout: 0.0
ffn_dropout: 0.0
kv_channels: null
apply_query_key_layer_scaling: true
normalization: rmsnorm
layernorm_epsilon: 1.0e-06
do_layer_norm_weight_decay: false
make_vocab_size_divisible_by: 128
pre_process: true
post_process: true
persist_layer_norm: true
bias: false
activation: fast-swiglu
headscale: false
transformer_block_type: pre_ln
openai_gelu: false
normalize_attention_scores: true
position_embedding_type: rope
rotary_percentage: 1.0
attention_type: multihead
share_embeddings_and_output_weights: false
overlap_p2p_comm: false
batch_p2p_comm: true
num_query_groups: 4
tokenizer:
library: sentencepiece
type: null
model: nemo:4f7bc9bb269d4abd9680ac15dcec4b16_tokenizer.model
vocab_file: null
merge_file: null
delimiter: null
sentencepiece_legacy: false
native_amp_init_scale: 4294967296
native_amp_growth_interval: 1000
hysteresis: 2
fp32_residual_connection: false
fp16_lm_cross_entropy: false
megatron_amp_O2: false
grad_allreduce_chunk_size_mb: 125
grad_div_ar_fusion: true
gradient_accumulation_fusion: false
bias_activation_fusion: false
bias_dropout_add_fusion: false
masked_softmax_fusion: true
get_attention_mask_from_fusion: true
apply_rope_fusion: false
seed: 1234
resume_from_checkpoint: null
use_cpu_initialization: false
onnx_safe: false
apex_transformer_log_level: 30
gradient_as_bucket_view: true
sync_batch_comm: false
activations_checkpoint_granularity: null
activations_checkpoint_method: null
activations_checkpoint_num_layers: null
num_micro_batches_with_partial_activation_checkpoints: null
activations_checkpoint_layers_per_pipeline: null
sequence_parallel: false
transformer_engine: true
fp8: false
fp8_e4m3: false
fp8_hybrid: true
fp8_margin: 0
fp8_interval: 1
fp8_amax_history_len: 1024
fp8_amax_compute_algo: max
reduce_amax: true
use_emha: false
data:
index_mapping_dir: null
data_impl: mmap
splits_string: 900,50,50
seq_length: 4096
skip_warmup: true
num_workers: 2
dataloader_type: single
reset_position_ids: false
reset_attention_mask: false
eod_mask_loss: false
validation_drop_last: true
no_seqlen_plus_one_input_tokens: false
pad_samples_to_global_batch_size: false
shuffle_documents: true
nsys_profile:
enabled: false
start_step: 10
end_step: 10
ranks:
- 0
gen_shape: false
optim:
name: fused_adam
lr: 0.0002
weight_decay: 0.01
betas:
- 0.9
- 0.98
sched:
name: CosineAnnealing
warmup_steps: 500
constant_steps: 50000
min_lr: 2.0e-05
rotary_base: 5000000.0
precision: 16
target: nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel
nemo_version: 1.23.0
|
Some additional information:
|
Hi, could you try setting |
Describe the bug
I Attempt to train reward models of different size(3B/6B/30B), and found out that when PP > 1, two type of issues arise
3B/6B:
30B:
parameters configuration:
Environment details
The text was updated successfully, but these errors were encountered: