Not able to run training/fsdp-qlora-distributed-llama3.ipynb #55

aasthavar · 2024-06-09T15:57:47Z

Hi @philschmid ! Thank you for the blog. Its very helpful.

I am trying to reproduce the results as it is. Followed the blog, installed libraries with same versions.

Running into following issue:
ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32

Someone mentioned setting FSDP_CPU_RAM_EFFICIENT_LOADING=1 here should solve, but this is already set in the torchrun command as per blog.

Pretty much clueless. Any suggestions would be really helpful.

The text was updated successfully, but these errors were encountered:

philschmid · 2024-06-09T18:21:58Z

You use the same versions? Same code? Or did you change something?

aasthavar · 2024-06-09T18:51:58Z

Okay, I had to install the latest flash-attn library to get rid of this error first (when I just ran the notebook as it is):
ImportError: /opt/conda/envs/pytorch/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda14ExchangeDeviceEa

# Install Pytorch for FSDP and FA/SDPA
%pip install --quiet "torch==2.2.2" tensorboard 
 
# Install Hugging Face libraries
%pip install  --upgrade "transformers==4.40.0" "datasets==2.18.0" "accelerate==0.29.3" "evaluate==0.4.1" "bitsandbytes==0.43.1" "huggingface_hub==0.22.2" "trl==0.8.6" "peft==0.10.0"
  
# I added  
%pip install flash-attn --no-build-isolation
%pip install "torch==2.3.1"

No change in code.

Is there a specific flash-attn lib version, I should be using ?

aasthavar · 2024-06-09T19:37:21Z

@philschmid No worries, able to make it work. Changed tf32's value from true to false. Did a quick test for max_steps=10.
The script ran completely.

This is weird, usually combination of bf16: true and tf32: true works but here it didn't. Wonder why ?

lakshya-B · 2024-06-13T05:28:17Z

@philschmid I encountered the same issue. However, when I changed bf16: true to bf16:false and tf32:true to tf16:true, it started working. I have another query. I am trying to fine-tune the Llama-3 8B model on a GPU with 15 GB of RAM, specifically using 4 NVIDIA T4 GPUs. I was running the same code you provided in the blog, but my entire model was being stored on a single GPU, causing a GPU out of memory error. Do you have any suggestions?

philschmid · 2024-06-13T06:37:31Z

T4 GPUs are not supporting Bf16 of TF32 thats expected.

lakshya-B · 2024-06-13T06:46:10Z

@philschmid regarding the training using 4 15GB GPUs, what do you think? I am using a smaller model (8B)

Oliph · 2024-07-04T06:50:13Z

I have the same error on 4 H100 GPU. If I set up tf32 to false it does not solve anything. Same when doing tf16:true as in #55 (comment)

aasthavar closed this as completed Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to run training/fsdp-qlora-distributed-llama3.ipynb #55

Not able to run training/fsdp-qlora-distributed-llama3.ipynb #55

aasthavar commented Jun 9, 2024 •

edited

Loading

philschmid commented Jun 9, 2024

aasthavar commented Jun 9, 2024 •

edited

Loading

aasthavar commented Jun 9, 2024

lakshya-B commented Jun 13, 2024

philschmid commented Jun 13, 2024

lakshya-B commented Jun 13, 2024

Oliph commented Jul 4, 2024

Not able to run training/fsdp-qlora-distributed-llama3.ipynb #55

Not able to run training/fsdp-qlora-distributed-llama3.ipynb #55

Comments

aasthavar commented Jun 9, 2024 • edited Loading

philschmid commented Jun 9, 2024

aasthavar commented Jun 9, 2024 • edited Loading

aasthavar commented Jun 9, 2024

lakshya-B commented Jun 13, 2024

philschmid commented Jun 13, 2024

lakshya-B commented Jun 13, 2024

Oliph commented Jul 4, 2024

aasthavar commented Jun 9, 2024 •

edited

Loading

aasthavar commented Jun 9, 2024 •

edited

Loading