Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVILA-15B Fine Tuning #171

Open
rahuljoshi078 opened this issue Dec 27, 2024 · 0 comments
Open

NVILA-15B Fine Tuning #171

rahuljoshi078 opened this issue Dec 27, 2024 · 0 comments

Comments

@rahuljoshi078
Copy link

We are trying to fine tune the latest NVILA-15B model. We are using only coco (M3IT/data/captioning/coco) dataset as a reference to create our custom dataset and preprocessed using script (python preprocess_m3it.py) which results as a .pkl file.
Now we have downloaded model from https://huggingface.co/Efficient-Large-Model/NVILA-15B.

We are running following bash command for Supervised Fine Tuning which requires model path and data path which we assume should be as follows:
model path> runs/train/NVILA-15B (same as cloned repository from above mentioned link).
data path> /home/sample_ft/M3IT/data/captioning/coco/captioning_coco_train.pkl (path to pkl file)

We are running below command on A100 80GB 1 GPU instance.

bash scripts/NVILA-Lite/sft.sh runs/train/NVILA-15B /home/sample_ft/M3IT/data/captioning/coco/captioning_coco_train.pkl

Error:
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
[2024-12-27 06:07:13,511] [INFO] [partition_parameters.py:453:exit] finished initializing model with 14.77B parameters
Loading checkpoint shards: 0%| | 0/6 [00:00<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/user/VILA/llava/train/train_mem.py", line 49, in
[rank0]: train()
[rank0]: File "/home/user/VILA/llava/train/train.py", line 512, in train
[rank0]: model = model_cls(
[rank0]: File "/home/user/VILA/llava/model/language_model/llava_llama.py", line 49, in init
[rank0]: self.init_vlm(config=config, *args, **kwargs)
[rank0]: File "/home/user/VILA/llava/model/llava_arch.py", line 75, in init_vlm
[rank0]: self.llm, self.tokenizer = build_llm_and_tokenizer(llm_cfg, config, *args, **kwargs)
[rank0]: File "/home/user/VILA/llava/model/language_model/builder.py", line 183, in build_llm_and_tokenizer
[rank0]: llm = AutoModelForCausalLM.from_pretrained(
[rank0]: File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
[rank0]: return model_class.from_pretrained(
[rank0]: File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4224, in from_pretrained
[rank0]: ) = cls._load_pretrained_model(
[rank0]: File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4705, in _load_pretrained_model
[rank0]: state_dict = load_state_dict(
[rank0]: File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 555, in load_state_dict
[rank0]: with safe_open(checkpoint_file, framework="pt") as f:
[rank0]: safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
E1227 06:07:17.403000 139169851979584 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 68458) of binary: /root/anaconda3/envs/vila_adv/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/vila_adv/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

llava/train/train_mem.py FAILED

Please help or suggest us in right way to finetune the current model on our custom dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant