-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
paligemma2-3B-mix in version4.49.0 not use GPU and 4.50.0.dev broken #36575
Comments
Graph breaks will be fixed by #36543. We had two clashing PRs merged and didn't catch the graphbreak |
Does #36543 will also fixed the problem of paligemma2 does not use GPU. The PR is two days ago and I have installed the latest code of transformers. Does this PR still not implemented? When can I test the new version of transformers |
Do you mean the model is not moved to gpu, even after The cache fix will be merged for the next release, I am not sure when exactly it will be ready. Feel free to track progress under the PR |
it is moved to GPU, but it is not work, The GPU utilization is always 0 in 4.49.0 and 4.50.0.dev, but it is OK in 4.47.1 and 4.48.3 |
@hanggun can you give more details, how exactly it doesn't work? Throwing errors or just being slow? If it is just slow, it doesn't mean the model wasn't moved to GPU (if you made sure beforehand the device is correct) I can look into why it slowed down a bit later, prob smth changed in core model loading |
Thank you very much. It does not throwing errors, maybe it is just slow. Because the time used in 4.49.0 is 20x longer than 4.47.1. So i guess the GPU is not utilised. If you has any information, please tell me~ |
Found the issue, after the last release we started auto compiling model generation whenever static cache is used (inlcuding Gemma2 hybrid cache). Compilation usually is very slow on the first call, and it needs warm-up with a few iterations @gante @ArthurZucker though it seems generating several times, compiles the forward from scratch every time. So we don't see much speed up even when generating after 10 random warmups |
Does it the correct behavior. Every start, it caused very long time to begin, it is not convinient |
@hanggun I'd recommend to use earlier version of transformers in the meantime, if generation time is slowing down you work |
But the new model class will always be in the new version for example the qwen2.5 should use version 4.49.0. So may I ask this feature will be the consistent feature in the future? Or can I see the compile progress |
@hanggun Gemma2 is a special case, where it has a special cache class. This class happens to be compileable and, in some cases, when The vast majority of models don't trigger compilation by default. In any case, I'm inspecting to see what's going on, and I will revert the automatic compilation on Gemma2 if I don't find another cause for the slowdowns 🤗 |
Ah, you're using quantization! I think quantization and compilation can't happen :) Maybe that's the root cause, I will double-check |
Should be fixed in #36519 (the root cause was indeed quantization, cc @zucchini-nlp ) |
Hi, Thank you for your help! However, I think the problem is in both quantization and non quantization. I start use bfloat16 and it triggers compilation. So I seek the code google used in colab that they used quantization for gemma-27b and I tried it to see whether it could deal the problem. And it is not. You can see my code that I only write a config, but I don't use it in the from_pretrained method |
System Info
transformers
version: 4.50.0.dev0- distributed_type: NO
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
from transformers import (
PaliGemmaProcessor,
PaliGemmaForConditionalGeneration,
)
from PIL import Image
import torch
import os
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
model_id = "/home/ps/data/pretrained_model/google/paligemma2-3b-mix-448/"
image = Image.open('40a07304-411a-41f9-afd0-30e25a145399.png')
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16,
device_map="auto").eval()
processor = PaliGemmaProcessor.from_pretrained(model_id)
prompt = "
describe en"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to(model.device)
input_len = model_inputs["input_ids"].shape[-1]
print(model_inputs)
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
Expected behavior
generate caption within 2 second, I tested version 4.47.1 and 4.48.3, it is as fast as possible, Howver, in 4.49.0, it does not use GPU, maybe it is compiled in CPU, and it is very slow. In 4.50.0.dev version, here is some graph breaks information
Please help me check and fixed this error
The text was updated successfully, but these errors were encountered: