Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu memory size recommended for pruning the llama2-7b-chat-hf model #44

Open
rsong0606 opened this issue Apr 29, 2024 · 4 comments
Open

Comments

@rsong0606
Copy link

Great work team!

Currently, I am pruning on the llama2-7b-chat-hf model from hugging face.

python main.py

--model NousResearch/Llama-2-7b-chat-hf
--prune_method wanda
--sparsity_ratio 0.5
--sparsity_type 2:4
--save out/llama_7b-chat-hf/structured/wanda/

got this error message:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 21.99 GiB of which 11.69 MiB is free. Including non-PyTorch memory, this process has 21.98 GiB memory in use. Of the allocated memory 20.84 GiB is allocated by PyTorch, and 61.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

My GPU specs are below
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA L4 On | 00000000:00:03.0 Off | 0 |
| N/A 52C P8 17W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

@Eric-mingjie
Copy link
Collaborator

I think you need at least 14GB GPU memory to load the 7b model in fp16.

@rsong0606
Copy link
Author

@Eric-mingjie Thanks Eric, mine is 24 GB GPU memory. Given that at least 14GB would be used to load the model. I still have ~10 GB left in Nvidia L4. Are there any extra activities taking more memory and can we avoid in the arguments?

@kast424
Copy link

kast424 commented May 8, 2024

Mine has 80GB of GPU RAM >>>>>NVIDIA A100 (and H100) GPU in Stanage has 80GB of GPU RAM
still got this error.
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU

complete error for reference:

torch 2.3.0
transformers 4.41.0.dev0
accelerate 0.31.0.dev0

of gpus: 1

loading llm model mistralai/Mistral-7B-Instruct-v0.2
^MLoading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]^MLoading checkpoint shards: 33%|███▎ | 1/3 [00:12<00:24, 12.09s/it]^MLoading checkpoint shards: 67%|██████▋ | 2/3 [00:29<00:15,$
use device cuda:0
pruning starts
loading calibdation data
dataset loading complete
Traceback (most recent call last):
File "/mnt/parscratch/users/acq22stk/teamproject/wanda/main.py", line 110, in
main()
File "/mnt/parscratch/users/acq22stk/teamproject/wanda/main.py", line 69, in main
prune_wanda(args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m)
File "/mnt/parscratch/users/acq22stk/teamproject/wanda/lib/prune.py", line 160, in prune_wanda
outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask, position_ids=position_ids)[0]
File "/users/acq22stk/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/users/acq22stk/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/users/acq22stk/.conda/envs/prune_llm/lib/python3.9/site-packages/transformers/models/mistral/modeling_mistral.py", line 754, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/users/acq22stk/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/users/acq22stk/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/users/acq22stk/.conda/envs/prune_llm/lib/python3.9/site-packages/transformers/models/mistral/modeling_mistral.py", line 85, in forward
hidden_states = hidden_states.to(torch.float32)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU

@nehaprakriya
Copy link

I have the same error with the Mixtral 8x7B model using 4 A6000 GPUs (48GiB memory per device).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants