-
Notifications
You must be signed in to change notification settings - Fork 967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: FlatParameter requires uniform dtype but got torch.float16 and torch.float32 #1620
Comments
Please fill the issue templates for the specific bug in Accelerate or close this. There is no point opening an issue in different repos if it's all the same. |
I linked this because someone else opened an issue in PEFT but the issue appears to actually be in accelerate, so wanted to make sure the right eyes see this issue. |
Here's the full original issue in PEFT
|
My own example/details System InfoRelevant Package Versions
Using base container nvidia/cuda:11.8.0-devel-ubuntu22.04 in docker on a linux box with 2x A6000 GPUs running ubuntu 22.04 When does this occur?When using custom scripts TasksMy own custom task/dataset, although it fails preparing the model before training so that's not relevant. I've stripped out the dataset code for the minimal example and just passed None for simplicity. It raises the same error in either case. ReproductionCondensed Script
ReproductionFull terminal output with script
Expected behaviorThe model is loaded with FSDP across the 2 GPUs without crashing |
Hello, FSDP with PEFT isn't leading to any memory savings when compared to plain pytorch. see this pytorch/pytorch#91165 (comment), It also shows how to use FSDP with PEFT nonetheless. |
Thanks for the heads up. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
I got the same error even after pre-casting all modules' parameters to be |
Hi bro, how did you fix that? Still stuck with the error |
same issue here, seems fsdp aint playing nice with peft. |
The error message I see is slightly different:
But, I think it's the same issue other folks on here seem to be facing. This happens when use the |
set FSDP_CPU_RAM_EFFICIENT_LOADING=1 solve the problem... |
I tried launching the script with This is the blog I am following. My command: These are the libraries: %pip install --quiet \
"torch==2.2.2" tensorboard
# Install Hugging Face libraries
%pip install --upgrade --quiet \
"transformers==4.40.0" "datasets==2.18.0" "accelerate==0.29.3" "evaluate==0.4.1" "bitsandbytes==0.43.1" "huggingface_hub==0.22.2" "trl==0.8.6" "peft==0.10.0" Any suggestions how to solve or further investigate the issue ? Is there any specific library version I am missing ? |
Reopening as came across this myself. Correct me if I'm wrong, have we enabled any |
This comment was marked as outdated.
This comment was marked as outdated.
Hi @muellerzr |
Not sure if it is the same issue. In my case, I used the sample code created by Schmid (https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/fsdp-qlora-distributed-llama3.ipynb) When I used newer transformers lib >= 4.41.0, I encountered the error. I looked at the changes between 4.40.2 and 4.41.0, I found this changeset huggingface/transformers@f16caf4. Then I was able to make the code work again by add the "cpu_ram_efficient_loading" to the fsdp_config. ie. fsdp_config: |
Hi @tle211212 I am using transformer lib ==4.42.4 and torch==2.3.1 fsdp_config: However, I am getting another error: output tensor size must be equal to world_size times input tensor size Command:
Any solution/suggestion to fix this. Thanks. |
Saddly to find this bug has not been fixed after more than one year. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
No stale. The BUG is still there. FSDP is a great tool for large and long context training, please fix it. Latest all libs installed.
** ERROR **
Thanks, |
+1 Tested with fsdp with qlora on qwen 7b using accelerate launcher.
|
@thusinh1969 are you also using LoRA/QLoRA or normal fine-tuning? @nivibilla Could you please show your train script, or at the very least how the base model and PEFT model are initialized? |
@BenjaminBossan sure
|
Thanks @nivibilla. I assume you're on the latest versions of the relevant libraries (PEFT, accelerate, transformers)? With your setting, I'm not sure if we'll get Another thing you could try is to coerce all LoRA modules to bfloat16. For this, after initializing the for name, module in model.named_modules():
if "lora_" in name:
module.to(torch.bfloat16) Normally this shouldn't be necessary but if it helps, we learn more about the source of the issue. |
Thanks @BenjaminBossan for the
Im using databricks so I prefer to use the notebook launcher if possible |
I have the same error trying to do QLora FSDP for I tried the solution proposed by @BenjaminBossan, but it didn't resolve the issue. However, trying to coerce all modules to bf16 seems to bypass the issue:
Even though it doesn't trigger the error, something else seems to be broken, as the training stalls until it eventually times out. Specifically, it will print out the start of the WandB logs but not print out the tqdm training progress bar. During this time, GPU memory consumption doesn't change according to
Strangely, I saw the second issue of a stalled training run when trying to run run_peft_qlora_fsdp.sh, which is referenced in HuggingFace's documentation page on QLora FSDP. Note that this issue seems to occur in this script with other models like Llama 2 7B/70B. However, the issue is resolved here if I use the minimum required package versions mentioned in the docs, i.e. Thanks to playing a lot of pypi version hopscotch, the offending change seems to be in However, the first issue seems present even when I used the minimum required package versions. Take this with a grain of salt, as I was only able to run haphazard tests; my codebase had several incompatibilities with older HF package versions. In summary, this seems to suggest to me that there's two issues here which might not be related:
Any insights into either of these issues? Please LMK if I need to file issues in other repos as well. Thanks! |
@nivibilla: Yes, I think it should be possible like that. @wizeng23 Thanks for your detailed report. Based on that, I ran my own experiments. What I found: When using transformers 4.44.2, I can train a Llama model (tested
(Note that the tokenizers version also needs to be changed, but that's probably not the cause) When checking where the float32 params come from, those are indeed the LoRA weights, but only on rank 1, while rank 0 is all bfloat16. When going to 4.44.2, the dtype is bfloat16 on all ranks. This explains why your coercion code fixes the issue. Normally, When I check this variable:
This should not happen, it needs to always be Regarding the issue with Llama 3.2 3B, I didn't have time to look into that yet, but let's first try to resolve this fundamental issue. Edit I also tried the latest transformers version (
I have hopes that this will be resolved with the same fix, so I'd say we can ignore it for now 🤞 |
Thanks for the analysis @BenjaminBossan! I'll just use the dtype coercion temporary fix for now while waiting for the root fix. If only the LoRA weights are float32, then your coercion code should also work right? Since that didn't work for me, I'm wondering if something else in the model is also float32. |
Also, in my codebase, reverting to |
If you downgrade to My configuration settings are as follows: You can follow this as a guide. # FSDP parameters: https://huggingface.co/docs/transformers/main/en/fsdp
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: true
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: false
fsdp_offload_optimizer: true
fsdp_activation_offload: true
machine_rank: 0
main_training_function: train
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false The most useful method for now is to change the version of the transformers library. Thanks to great insight @BenjaminBossan |
Just a small update: The problem is that after this transformers PR, the model weights are loaded on meta device for memory reasons. This broke a check in trl but there is already a fix for it: huggingface/trl#2089. It's just not released yet. Installing trl from source should fix the error. I could verify locally that the fix from that PR is enough to make FSDP QLoRA training work again using transformers 4.45.0, 4.45.1, and 4.45.2. There is still an error when using transformers installed from |
Now it's released: https://pypi.org/project/trl/0.11.4/
|
Also an update on this issue to anyone trying to use transformers > 4.45 (i.e installed from source or if you're from the future): When using FSDP + QLoRA, transformers will automatically set |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@BenjaminBossan yes regarding what you mentioned here, we addressed it ourselves in our repo here foundation-model-stack/fms-acceleration#96. Generally we just put the embeddings back on CPU |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
See huggingface/peft#484
Expected behavior
The training code is able to handle the selected FP16 weights selected via accelerate config.
Apologies for linking everything but it's all been provided already by another OP and I am up too late already debugging.
The text was updated successfully, but these errors were encountered: