-
Notifications
You must be signed in to change notification settings - Fork 295
Description
Description
save_sharded_modelopt_state crashes during checkpoint saving when hierarchical_context_parallel_sizes is configured (required for cp_comm_type: a2a+p2p). Training runs correctly — only checkpoint saving fails.
Error
File "modelopt/torch/opt/plugins/mcore_dist_checkpointing.py", line 151, in _parse_transformer_config
config[k] = str(v)
File "megatron/core/process_groups_config.py", line 150, in __repr__
active_pgs.append(f"{field_info.name}({pg.size()})")
AttributeError: 'list' object has no attribute 'size'
Root Cause
_parse_transformer_config calls str(v) on every field in TransformerConfig.__dict__. When hierarchical CP is enabled, one of the fields in ProcessGroupCollection holds a list of ProcessGroup objects instead of a single ProcessGroup. Calling str() triggers ProcessGroupCollection.__repr__, which assumes every field has a .size() method — but list does not.
Reproduction
model:
cp_comm_type: a2a+p2p
hierarchical_context_parallel_sizes: [8, 2]
context_parallel_size: 164 nodes x 8 GPUs, nvcr.io/nvidia/nemo:26.02 container. Training runs fine for N iterations, then crashes at the first checkpoint save.
Suggested Fix
In _parse_transformer_config, wrap the str(v) call to handle objects whose __repr__ may fail:
try:
config[k] = str(v)
except (AttributeError, TypeError):
config[k] = repr(type(v))A secondary fix in Megatron-LM's ProcessGroupCollection.__repr__ should also handle list-typed fields.
Environment
- Container:
nvcr.io/nvidia/nemo:26.02 - modelopt: bundled version
- Megatron-LM:
core_r0.16.0