Skip to content

AttributeError: 'NemotronHForCausalLM' object has no attribute 'model'. Bug in supporting Nemotron 3 model. Parallelizm fail. #1149

@DzmitryPihulski

Description

@DzmitryPihulski

Describe the bug

The nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 model is out and I tried to finetune it with 4 nodes (4 gpus each 94GB) and 8k seq length. Here is the config:

step_scheduler:
  global_batch_size: 32
  local_batch_size: 1
  max_steps: 1000
  ckpt_every_steps: 250
  val_every_steps: 250
  num_epochs: 1

rng:
  _target_: nemo_automodel.components.training.rng.StatefulRNG
  seed: 1111
  ranked: true

model:
  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
  # pretrained_model_name_or_path: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
  trust_remote_code: true
  pretrained_model_name_or_path: ./model_weights
  torch_dtype: bfloat16


checkpoint:
    model_save_format: safetensors
    save_consolidated: true 

loss_fn:
  _target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy

dataset:
  _target_: ./translate_dataset.py:build_opus_dataset
  split: train
  seq_length: 8192

packed_sequence:
  packed_sequence_size: 8192

validation_dataset:
  _target_: ./translate_dataset.py:build_opus_dataset
  split: validation
  seq_length: 8192

dataloader:
  _target_: torchdata.stateful_dataloader.StatefulDataLoader
  collate_fn: nemo_automodel.components.datasets.utils.default_collater
  shuffle: True

validation_dataloader:
  _target_: torchdata.stateful_dataloader.StatefulDataLoader
  collate_fn: nemo_automodel.components.datasets.utils.default_collater

optimizer:
  _target_: torch.optim.Adam
  betas: [0.9, 0.999]
  eps: 1e-8
  lr: 1.0e-5
  weight_decay: 0

lr_scheduler:
  lr_decay_style: cosine
  min_lr: 1.0e-6


parallelizer:
  _target_: nemo_automodel.components.moe.parallelizer.parallelize_model
  activation_checkpointing: false

distributed:
  _target_: nemo_automodel.components.distributed.fsdp2.FSDP2Manager
  dp_size: none
  tp_size: 1
  cp_size: 1
  ep_size: 8
  dp_replicate_size: 1
  sequence_parallel: false
  activation_checkpointing: true
  use_hf_tp_plan: true

compile:
  enabled: false
  mode: "default"
  fullgraph: false
  dynamic: true
  backend: null

I got this problem on all ranks:

[rank11]: Traceback (most recent call last):
[rank11]:   File "/lustre/tmp/slurm/4671202/mount/finetune.py", line 24, in <module>
[rank11]:     main()
[rank11]:   File "/lustre/tmp/slurm/4671202/mount/finetune.py", line 18, in main
[rank11]:     recipe.setup()
[rank11]:   File "/opt/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 976, in setup
[rank11]:     model, model_state_dict_keys, self.optimizer, self.loss_fn, self.param_info = build_model_and_optimizer(
[rank11]:                                                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]:   File "/opt/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 275, in build_model_and_optimizer
[rank11]:     parallelize_fn(
[rank11]:   File "/opt/Automodel/nemo_automodel/components/config/loader.py", line 378, in instantiate
[rank11]:     return func(*args, **config_kwargs)
[rank11]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]:   File "/opt/Automodel/nemo_automodel/components/moe/parallelizer.py", line 266, in parallelize_model
[rank11]:     assert model.model.moe_config.n_routed_experts % moe_mesh[ep_axis_name].size() == 0, (
[rank11]:            ^^^^^^^^^^^
[rank11]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__
[rank11]:     raise AttributeError(
[rank11]: AttributeError: 'NemotronHForCausalLM' object has no attribute 'model'

So the NemotronHForCausalLM class doesn't have the attribute model, which is needed in all the function in the nemo_automodel.components.moe.parallelizer module.

It is also very interesting cause in your examples/llm_finetune/nemotron/nemotron_nano_v3_hellaswag.yaml you have a recipe that uses expert parallelism and this parallelizer function. How it could be possible if the class doesn't have required module? Has anyone tested this recipe?

I am using package version 0.2.0, where the nemotron 3 wasn't added, but in the main brunch you still have the same parallelizer module, so the issue persists.

As I understand, I can now finetune the model only without any sort of parallelization, even FSDP will not share the model across GPUs and since the model is too large to fit in one gpu, it can't be finetuned at all?

Versions
nemo_automodel: 0.2.0rc0 (/opt/Automodel/nemo_automodel/init.py)
transformers: 4.57.1 (/opt/venv/lib/python3.12/site-packages/transformers/init.py)
torch: 2.9.0a0+50eac811a6.nv25.09 CUDA 13.0

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions