AttributeError: 'NemotronHForCausalLM' object has no attribute 'model'. Bug in supporting Nemotron 3 model. Parallelizm fail.

**Describe the bug**

The nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 model is out and I tried to finetune it with 4 nodes (4 gpus each 94GB) and 8k seq length. Here is the config:
```
step_scheduler:
  global_batch_size: 32
  local_batch_size: 1
  max_steps: 1000
  ckpt_every_steps: 250
  val_every_steps: 250
  num_epochs: 1

rng:
  _target_: nemo_automodel.components.training.rng.StatefulRNG
  seed: 1111
  ranked: true

model:
  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
  # pretrained_model_name_or_path: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
  trust_remote_code: true
  pretrained_model_name_or_path: ./model_weights
  torch_dtype: bfloat16


checkpoint:
    model_save_format: safetensors
    save_consolidated: true 

loss_fn:
  _target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy

dataset:
  _target_: ./translate_dataset.py:build_opus_dataset
  split: train
  seq_length: 8192

packed_sequence:
  packed_sequence_size: 8192

validation_dataset:
  _target_: ./translate_dataset.py:build_opus_dataset
  split: validation
  seq_length: 8192

dataloader:
  _target_: torchdata.stateful_dataloader.StatefulDataLoader
  collate_fn: nemo_automodel.components.datasets.utils.default_collater
  shuffle: True

validation_dataloader:
  _target_: torchdata.stateful_dataloader.StatefulDataLoader
  collate_fn: nemo_automodel.components.datasets.utils.default_collater

optimizer:
  _target_: torch.optim.Adam
  betas: [0.9, 0.999]
  eps: 1e-8
  lr: 1.0e-5
  weight_decay: 0

lr_scheduler:
  lr_decay_style: cosine
  min_lr: 1.0e-6


parallelizer:
  _target_: nemo_automodel.components.moe.parallelizer.parallelize_model
  activation_checkpointing: false

distributed:
  _target_: nemo_automodel.components.distributed.fsdp2.FSDP2Manager
  dp_size: none
  tp_size: 1
  cp_size: 1
  ep_size: 8
  dp_replicate_size: 1
  sequence_parallel: false
  activation_checkpointing: true
  use_hf_tp_plan: true

compile:
  enabled: false
  mode: "default"
  fullgraph: false
  dynamic: true
  backend: null

```

I got this problem on all ranks:
```
[rank11]: Traceback (most recent call last):
[rank11]:   File "/lustre/tmp/slurm/4671202/mount/finetune.py", line 24, in <module>
[rank11]:     main()
[rank11]:   File "/lustre/tmp/slurm/4671202/mount/finetune.py", line 18, in main
[rank11]:     recipe.setup()
[rank11]:   File "/opt/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 976, in setup
[rank11]:     model, model_state_dict_keys, self.optimizer, self.loss_fn, self.param_info = build_model_and_optimizer(
[rank11]:                                                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]:   File "/opt/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 275, in build_model_and_optimizer
[rank11]:     parallelize_fn(
[rank11]:   File "/opt/Automodel/nemo_automodel/components/config/loader.py", line 378, in instantiate
[rank11]:     return func(*args, **config_kwargs)
[rank11]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]:   File "/opt/Automodel/nemo_automodel/components/moe/parallelizer.py", line 266, in parallelize_model
[rank11]:     assert model.model.moe_config.n_routed_experts % moe_mesh[ep_axis_name].size() == 0, (
[rank11]:            ^^^^^^^^^^^
[rank11]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__
[rank11]:     raise AttributeError(
[rank11]: AttributeError: 'NemotronHForCausalLM' object has no attribute 'model'
```

So the NemotronHForCausalLM class doesn't have the attribute model, which is needed in all the function in the `nemo_automodel.components.moe.parallelizer` module. 

It is also very interesting cause in your `examples/llm_finetune/nemotron/nemotron_nano_v3_hellaswag.yaml` you have a recipe that uses expert parallelism and this parallelizer function. How it could be possible if the class doesn't have required module? Has anyone tested this recipe?

I am using package version 0.2.0, where the nemotron 3 wasn't added, but in the `main` brunch you still have the same parallelizer module, so the issue persists.

As I understand, I can now finetune the model only without any sort of parallelization, even FSDP will not share the model across GPUs and since the model is too large to fit in one gpu, it can't be finetuned at all?



**Versions**
nemo_automodel: 0.2.0rc0 (/opt/Automodel/nemo_automodel/__init__.py)
transformers: 4.57.1 (/opt/venv/lib/python3.12/site-packages/transformers/__init__.py)
torch: 2.9.0a0+50eac811a6.nv25.09 CUDA 13.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'NemotronHForCausalLM' object has no attribute 'model'. Bug in supporting Nemotron 3 model. Parallelizm fail. #1149

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AttributeError: 'NemotronHForCausalLM' object has no attribute 'model'. Bug in supporting Nemotron 3 model. Parallelizm fail. #1149

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions