Skip to content

save_sharded_modelopt_state crashes with hierarchical context parallel groups #981

@shanecmoran

Description

@shanecmoran

Description

save_sharded_modelopt_state crashes during checkpoint saving when hierarchical_context_parallel_sizes is configured (required for cp_comm_type: a2a+p2p). Training runs correctly — only checkpoint saving fails.

Error

File "modelopt/torch/opt/plugins/mcore_dist_checkpointing.py", line 151, in _parse_transformer_config
    config[k] = str(v)
File "megatron/core/process_groups_config.py", line 150, in __repr__
    active_pgs.append(f"{field_info.name}({pg.size()})")
AttributeError: 'list' object has no attribute 'size'

Root Cause

_parse_transformer_config calls str(v) on every field in TransformerConfig.__dict__. When hierarchical CP is enabled, one of the fields in ProcessGroupCollection holds a list of ProcessGroup objects instead of a single ProcessGroup. Calling str() triggers ProcessGroupCollection.__repr__, which assumes every field has a .size() method — but list does not.

Reproduction

model:
  cp_comm_type: a2a+p2p
  hierarchical_context_parallel_sizes: [8, 2]
  context_parallel_size: 16

4 nodes x 8 GPUs, nvcr.io/nvidia/nemo:26.02 container. Training runs fine for N iterations, then crashes at the first checkpoint save.

Suggested Fix

In _parse_transformer_config, wrap the str(v) call to handle objects whose __repr__ may fail:

try:
    config[k] = str(v)
except (AttributeError, TypeError):
    config[k] = repr(type(v))

A secondary fix in Megatron-LM's ProcessGroupCollection.__repr__ should also handle list-typed fields.

Environment

  • Container: nvcr.io/nvidia/nemo:26.02
  • modelopt: bundled version
  • Megatron-LM: core_r0.16.0

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions