NVILA-15B Fine Tuning #171

rahuljoshi078 · 2024-12-27T06:46:32Z

We are trying to fine tune the latest NVILA-15B model. We are using only coco (M3IT/data/captioning/coco) dataset as a reference to create our custom dataset and preprocessed using script (python preprocess_m3it.py) which results as a .pkl file.
Now we have downloaded model from https://huggingface.co/Efficient-Large-Model/NVILA-15B.

We are running following bash command for Supervised Fine Tuning which requires model path and data path which we assume should be as follows:
model path> runs/train/NVILA-15B (same as cloned repository from above mentioned link).
data path> /home/sample_ft/M3IT/data/captioning/coco/captioning_coco_train.pkl (path to pkl file)

We are running below command on A100 80GB 1 GPU instance.

bash scripts/NVILA-Lite/sft.sh runs/train/NVILA-15B /home/sample_ft/M3IT/data/captioning/coco/captioning_coco_train.pkl

Error:
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
[2024-12-27 06:07:13,511] [INFO] [partition_parameters.py:453:exit] finished initializing model with 14.77B parameters
Loading checkpoint shards: 0%| | 0/6 [00:00<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/user/VILA/llava/train/train_mem.py", line 49, in
[rank0]: train()
[rank0]: File "/home/user/VILA/llava/train/train.py", line 512, in train
[rank0]: model = model_cls(
[rank0]: File "/home/user/VILA/llava/model/language_model/llava_llama.py", line 49, in init
[rank0]: self.init_vlm(config=config, *args, **kwargs)
[rank0]: File "/home/user/VILA/llava/model/llava_arch.py", line 75, in init_vlm
[rank0]: self.llm, self.tokenizer = build_llm_and_tokenizer(llm_cfg, config, *args, **kwargs)
[rank0]: File "/home/user/VILA/llava/model/language_model/builder.py", line 183, in build_llm_and_tokenizer
[rank0]: llm = AutoModelForCausalLM.from_pretrained(
[rank0]: File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
[rank0]: return model_class.from_pretrained(
[rank0]: File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4224, in from_pretrained
[rank0]: ) = cls._load_pretrained_model(
[rank0]: File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4705, in _load_pretrained_model
[rank0]: state_dict = load_state_dict(
[rank0]: File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 555, in load_state_dict
[rank0]: with safe_open(checkpoint_file, framework="pt") as f:
[rank0]: safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
E1227 06:07:17.403000 139169851979584 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 68458) of binary: /root/anaconda3/envs/vila_adv/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/vila_adv/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

llava/train/train_mem.py FAILED

Please help or suggest us in right way to finetune the current model on our custom dataset.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVILA-15B Fine Tuning #171

NVILA-15B Fine Tuning #171

rahuljoshi078 commented Dec 27, 2024

NVILA-15B Fine Tuning #171

NVILA-15B Fine Tuning #171

Comments

rahuljoshi078 commented Dec 27, 2024