save_sharded_modelopt_state crashes with hierarchical context parallel groups

## Description

`save_sharded_modelopt_state` crashes during checkpoint saving when `hierarchical_context_parallel_sizes` is configured (required for `cp_comm_type: a2a+p2p`). Training runs correctly — only checkpoint saving fails.

## Error

```
File "modelopt/torch/opt/plugins/mcore_dist_checkpointing.py", line 151, in _parse_transformer_config
    config[k] = str(v)
File "megatron/core/process_groups_config.py", line 150, in __repr__
    active_pgs.append(f"{field_info.name}({pg.size()})")
AttributeError: 'list' object has no attribute 'size'
```

## Root Cause

`_parse_transformer_config` calls `str(v)` on every field in `TransformerConfig.__dict__`. When hierarchical CP is enabled, one of the fields in `ProcessGroupCollection` holds a **list** of `ProcessGroup` objects instead of a single `ProcessGroup`. Calling `str()` triggers `ProcessGroupCollection.__repr__`, which assumes every field has a `.size()` method — but `list` does not.

## Reproduction

```yaml
model:
  cp_comm_type: a2a+p2p
  hierarchical_context_parallel_sizes: [8, 2]
  context_parallel_size: 16
```

4 nodes x 8 GPUs, `nvcr.io/nvidia/nemo:26.02` container. Training runs fine for N iterations, then crashes at the first checkpoint save.

## Suggested Fix

In `_parse_transformer_config`, wrap the `str(v)` call to handle objects whose `__repr__` may fail:

```python
try:
    config[k] = str(v)
except (AttributeError, TypeError):
    config[k] = repr(type(v))
```

A secondary fix in Megatron-LM's `ProcessGroupCollection.__repr__` should also handle list-typed fields.

## Environment

- Container: `nvcr.io/nvidia/nemo:26.02`
- modelopt: bundled version
- Megatron-LM: `core_r0.16.0`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

save_sharded_modelopt_state crashes with hierarchical context parallel groups #981

Description

Error

Root Cause

Reproduction

Suggested Fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

save_sharded_modelopt_state crashes with hierarchical context parallel groups #981

Description

Description

Error

Root Cause

Reproduction

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions