You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are trying to fine tune the latest NVILA-15B model. We are using only coco (M3IT/data/captioning/coco) dataset as a reference to create our custom dataset and preprocessed using script (python preprocess_m3it.py) which results as a .pkl file.
Now we have downloaded model from https://huggingface.co/Efficient-Large-Model/NVILA-15B.
We are running following bash command for Supervised Fine Tuning which requires model path and data path which we assume should be as follows:
model path> runs/train/NVILA-15B (same as cloned repository from above mentioned link).
data path> /home/sample_ft/M3IT/data/captioning/coco/captioning_coco_train.pkl (path to pkl file)
We are running below command on A100 80GB 1 GPU instance.
Error:
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
[2024-12-27 06:07:13,511] [INFO] [partition_parameters.py:453:exit] finished initializing model with 14.77B parameters
Loading checkpoint shards: 0%| | 0/6 [00:00<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/user/VILA/llava/train/train_mem.py", line 49, in
[rank0]: train()
[rank0]: File "/home/user/VILA/llava/train/train.py", line 512, in train
[rank0]: model = model_cls(
[rank0]: File "/home/user/VILA/llava/model/language_model/llava_llama.py", line 49, in init
[rank0]: self.init_vlm(config=config, *args, **kwargs)
[rank0]: File "/home/user/VILA/llava/model/llava_arch.py", line 75, in init_vlm
[rank0]: self.llm, self.tokenizer = build_llm_and_tokenizer(llm_cfg, config, *args, **kwargs)
[rank0]: File "/home/user/VILA/llava/model/language_model/builder.py", line 183, in build_llm_and_tokenizer
[rank0]: llm = AutoModelForCausalLM.from_pretrained(
[rank0]: File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
[rank0]: return model_class.from_pretrained(
[rank0]: File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4224, in from_pretrained
[rank0]: ) = cls._load_pretrained_model(
[rank0]: File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4705, in _load_pretrained_model
[rank0]: state_dict = load_state_dict(
[rank0]: File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 555, in load_state_dict
[rank0]: with safe_open(checkpoint_file, framework="pt") as f:
[rank0]: safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
E1227 06:07:17.403000 139169851979584 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 68458) of binary: /root/anaconda3/envs/vila_adv/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/vila_adv/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
llava/train/train_mem.py FAILED
Please help or suggest us in right way to finetune the current model on our custom dataset.
The text was updated successfully, but these errors were encountered:
We are trying to fine tune the latest NVILA-15B model. We are using only coco (M3IT/data/captioning/coco) dataset as a reference to create our custom dataset and preprocessed using script (python preprocess_m3it.py) which results as a .pkl file.
Now we have downloaded model from https://huggingface.co/Efficient-Large-Model/NVILA-15B.
We are running following bash command for Supervised Fine Tuning which requires model path and data path which we assume should be as follows:
model path> runs/train/NVILA-15B (same as cloned repository from above mentioned link).
data path> /home/sample_ft/M3IT/data/captioning/coco/captioning_coco_train.pkl (path to pkl file)
We are running below command on A100 80GB 1 GPU instance.
bash scripts/NVILA-Lite/sft.sh runs/train/NVILA-15B /home/sample_ft/M3IT/data/captioning/coco/captioning_coco_train.pkl
Error:
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with
model.to('cuda')
.[2024-12-27 06:07:13,511] [INFO] [partition_parameters.py:453:exit] finished initializing model with 14.77B parameters
Loading checkpoint shards: 0%| | 0/6 [00:00<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/user/VILA/llava/train/train_mem.py", line 49, in
[rank0]: train()
[rank0]: File "/home/user/VILA/llava/train/train.py", line 512, in train
[rank0]: model = model_cls(
[rank0]: File "/home/user/VILA/llava/model/language_model/llava_llama.py", line 49, in init
[rank0]: self.init_vlm(config=config, *args, **kwargs)
[rank0]: File "/home/user/VILA/llava/model/llava_arch.py", line 75, in init_vlm
[rank0]: self.llm, self.tokenizer = build_llm_and_tokenizer(llm_cfg, config, *args, **kwargs)
[rank0]: File "/home/user/VILA/llava/model/language_model/builder.py", line 183, in build_llm_and_tokenizer
[rank0]: llm = AutoModelForCausalLM.from_pretrained(
[rank0]: File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
[rank0]: return model_class.from_pretrained(
[rank0]: File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4224, in from_pretrained
[rank0]: ) = cls._load_pretrained_model(
[rank0]: File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4705, in _load_pretrained_model
[rank0]: state_dict = load_state_dict(
[rank0]: File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 555, in load_state_dict
[rank0]: with safe_open(checkpoint_file, framework="pt") as f:
[rank0]: safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
E1227 06:07:17.403000 139169851979584 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 68458) of binary: /root/anaconda3/envs/vila_adv/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/vila_adv/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/vila_adv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
llava/train/train_mem.py FAILED
Please help or suggest us in right way to finetune the current model on our custom dataset.
The text was updated successfully, but these errors were encountered: