-
Notifications
You must be signed in to change notification settings - Fork 59
Description
Describe the bug
The nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 model is out and I tried to finetune it with 4 nodes (4 gpus each 94GB) and 8k seq length. Here is the config:
step_scheduler:
global_batch_size: 32
local_batch_size: 1
max_steps: 1000
ckpt_every_steps: 250
val_every_steps: 250
num_epochs: 1
rng:
_target_: nemo_automodel.components.training.rng.StatefulRNG
seed: 1111
ranked: true
model:
_target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
# pretrained_model_name_or_path: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
trust_remote_code: true
pretrained_model_name_or_path: ./model_weights
torch_dtype: bfloat16
checkpoint:
model_save_format: safetensors
save_consolidated: true
loss_fn:
_target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy
dataset:
_target_: ./translate_dataset.py:build_opus_dataset
split: train
seq_length: 8192
packed_sequence:
packed_sequence_size: 8192
validation_dataset:
_target_: ./translate_dataset.py:build_opus_dataset
split: validation
seq_length: 8192
dataloader:
_target_: torchdata.stateful_dataloader.StatefulDataLoader
collate_fn: nemo_automodel.components.datasets.utils.default_collater
shuffle: True
validation_dataloader:
_target_: torchdata.stateful_dataloader.StatefulDataLoader
collate_fn: nemo_automodel.components.datasets.utils.default_collater
optimizer:
_target_: torch.optim.Adam
betas: [0.9, 0.999]
eps: 1e-8
lr: 1.0e-5
weight_decay: 0
lr_scheduler:
lr_decay_style: cosine
min_lr: 1.0e-6
parallelizer:
_target_: nemo_automodel.components.moe.parallelizer.parallelize_model
activation_checkpointing: false
distributed:
_target_: nemo_automodel.components.distributed.fsdp2.FSDP2Manager
dp_size: none
tp_size: 1
cp_size: 1
ep_size: 8
dp_replicate_size: 1
sequence_parallel: false
activation_checkpointing: true
use_hf_tp_plan: true
compile:
enabled: false
mode: "default"
fullgraph: false
dynamic: true
backend: null
I got this problem on all ranks:
[rank11]: Traceback (most recent call last):
[rank11]: File "/lustre/tmp/slurm/4671202/mount/finetune.py", line 24, in <module>
[rank11]: main()
[rank11]: File "/lustre/tmp/slurm/4671202/mount/finetune.py", line 18, in main
[rank11]: recipe.setup()
[rank11]: File "/opt/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 976, in setup
[rank11]: model, model_state_dict_keys, self.optimizer, self.loss_fn, self.param_info = build_model_and_optimizer(
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/opt/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 275, in build_model_and_optimizer
[rank11]: parallelize_fn(
[rank11]: File "/opt/Automodel/nemo_automodel/components/config/loader.py", line 378, in instantiate
[rank11]: return func(*args, **config_kwargs)
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/opt/Automodel/nemo_automodel/components/moe/parallelizer.py", line 266, in parallelize_model
[rank11]: assert model.model.moe_config.n_routed_experts % moe_mesh[ep_axis_name].size() == 0, (
[rank11]: ^^^^^^^^^^^
[rank11]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__
[rank11]: raise AttributeError(
[rank11]: AttributeError: 'NemotronHForCausalLM' object has no attribute 'model'
So the NemotronHForCausalLM class doesn't have the attribute model, which is needed in all the function in the nemo_automodel.components.moe.parallelizer module.
It is also very interesting cause in your examples/llm_finetune/nemotron/nemotron_nano_v3_hellaswag.yaml you have a recipe that uses expert parallelism and this parallelizer function. How it could be possible if the class doesn't have required module? Has anyone tested this recipe?
I am using package version 0.2.0, where the nemotron 3 wasn't added, but in the main brunch you still have the same parallelizer module, so the issue persists.
As I understand, I can now finetune the model only without any sort of parallelization, even FSDP will not share the model across GPUs and since the model is too large to fit in one gpu, it can't be finetuned at all?
Versions
nemo_automodel: 0.2.0rc0 (/opt/Automodel/nemo_automodel/init.py)
transformers: 4.57.1 (/opt/venv/lib/python3.12/site-packages/transformers/init.py)
torch: 2.9.0a0+50eac811a6.nv25.09 CUDA 13.0