Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to run training/fsdp-qlora-distributed-llama3.ipynb #55

Closed
aasthavar opened this issue Jun 9, 2024 · 7 comments
Closed

Not able to run training/fsdp-qlora-distributed-llama3.ipynb #55

aasthavar opened this issue Jun 9, 2024 · 7 comments

Comments

@aasthavar
Copy link

aasthavar commented Jun 9, 2024

Hi @philschmid ! Thank you for the blog. Its very helpful.

I am trying to reproduce the results as it is. Followed the blog, installed libraries with same versions.

Running into following issue:
ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32

Someone mentioned setting FSDP_CPU_RAM_EFFICIENT_LOADING=1 here should solve, but this is already set in the torchrun command as per blog.

Pretty much clueless. Any suggestions would be really helpful.

@philschmid
Copy link
Owner

You use the same versions? Same code? Or did you change something?

@aasthavar
Copy link
Author

aasthavar commented Jun 9, 2024

Okay, I had to install the latest flash-attn library to get rid of this error first (when I just ran the notebook as it is):
ImportError: /opt/conda/envs/pytorch/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda14ExchangeDeviceEa

# Install Pytorch for FSDP and FA/SDPA
%pip install --quiet "torch==2.2.2" tensorboard 
 
# Install Hugging Face libraries
%pip install  --upgrade "transformers==4.40.0" "datasets==2.18.0" "accelerate==0.29.3" "evaluate==0.4.1" "bitsandbytes==0.43.1" "huggingface_hub==0.22.2" "trl==0.8.6" "peft==0.10.0"
  
# I added  
%pip install flash-attn --no-build-isolation
%pip install "torch==2.3.1"

No change in code.

Is there a specific flash-attn lib version, I should be using ?

@aasthavar
Copy link
Author

@philschmid No worries, able to make it work. Changed tf32's value from true to false. Did a quick test for max_steps=10.
The script ran completely.

This is weird, usually combination of bf16: true and tf32: true works but here it didn't. Wonder why ?

@lakshya-B
Copy link

@philschmid I encountered the same issue. However, when I changed bf16: true to bf16:false and tf32:true to tf16:true, it started working. I have another query. I am trying to fine-tune the Llama-3 8B model on a GPU with 15 GB of RAM, specifically using 4 NVIDIA T4 GPUs. I was running the same code you provided in the blog, but my entire model was being stored on a single GPU, causing a GPU out of memory error. Do you have any suggestions?

@philschmid
Copy link
Owner

T4 GPUs are not supporting Bf16 of TF32 thats expected.

@lakshya-B
Copy link

@philschmid regarding the training using 4 15GB GPUs, what do you think? I am using a smaller model (8B)

@Oliph
Copy link

Oliph commented Jul 4, 2024

I have the same error on 4 H100 GPU. If I set up tf32 to false it does not solve anything. Same when doing tf16:true as in #55 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants