paligemma2-3B-mix in version4.49.0 not use GPU and 4.50.0.dev broken #36575

hanggun · 2025-03-06T06:31:56Z

System Info

transformers version: 4.50.0.dev0
Platform: Linux-5.15.0-86-generic-x86_64-with-glibc2.35
Python version: 3.12.3
Huggingface_hub version: 0.29.2
Safetensors version: 0.5.3
Accelerate version: 1.2.1
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: NO
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
DeepSpeed version: not installed
PyTorch version (GPU?): 2.5.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA A100-PCIE-40GB

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import (
PaliGemmaProcessor,
PaliGemmaForConditionalGeneration,
)
from PIL import Image
import torch
import os
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

model_id = "/home/ps/data/pretrained_model/google/paligemma2-3b-mix-448/"

image = Image.open('40a07304-411a-41f9-afd0-30e25a145399.png')

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16,
device_map="auto").eval()
processor = PaliGemmaProcessor.from_pretrained(model_id)

prompt = "describe en"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to(model.device)
input_len = model_inputs["input_ids"].shape[-1]
print(model_inputs)

with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

Expected behavior

generate caption within 2 second, I tested version 4.47.1 and 4.48.3, it is as fast as possible, Howver, in 4.49.0, it does not use GPU, maybe it is compiled in CPU, and it is very slow. In 4.50.0.dev version, here is some graph breaks information

skipping cudagraphs due to mutated inputs (52 instances). Found from :
   File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/paligemma/modeling_paligemma.py", line 532, in forward
    outputs = self.language_model(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 886, in forward
    outputs = self.model(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 667, in forward
    layer_outputs = decoder_layer(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 321, in forward
    hidden_states, self_attn_weights = self.self_attn(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 231, in forward
    key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/cache_utils.py", line 1732, in update
    return update_fn(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/cache_utils.py", line 1696, in _sliding_update
    self.key_cache[layer_idx] += k_out

skipping cudagraphs due to mutated inputs (52 instances). Found from :
   File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/paligemma/modeling_paligemma.py", line 532, in forward
    outputs = self.language_model(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 886, in forward
    outputs = self.model(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 667, in forward
    layer_outputs = decoder_layer(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 321, in forward
    hidden_states, self_attn_weights = self.self_attn(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 231, in forward
    key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/cache_utils.py", line 1732, in update
    return update_fn(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/cache_utils.py", line 1696, in _sliding_update
    self.key_cache[layer_idx] += k_out
and repeated so on ....

Please help me check and fixed this error

The text was updated successfully, but these errors were encountered:

zucchini-nlp · 2025-03-06T09:39:46Z

Graph breaks will be fixed by #36543. We had two clashing PRs merged and didn't catch the graphbreak

hanggun · 2025-03-06T09:45:52Z

Graph breaks will be fixed by #36543. We had two clashing PRs merged and didn't catch the graphbreak

Does #36543 will also fixed the problem of paligemma2 does not use GPU. The PR is two days ago and I have installed the latest code of transformers. Does this PR still not implemented? When can I test the new version of transformers

zucchini-nlp · 2025-03-06T09:49:55Z

Do you mean the model is not moved to gpu, even after model.to('cuda')? AFAIR it should work

The cache fix will be merged for the next release, I am not sure when exactly it will be ready. Feel free to track progress under the PR

hanggun · 2025-03-06T09:51:40Z

Do you mean the model is not moved to gpu, even after model.to('cuda')? AFAIR it should work

it is moved to GPU, but it is not work, The GPU utilization is always 0 in 4.49.0 and 4.50.0.dev, but it is OK in 4.47.1 and 4.48.3

zucchini-nlp · 2025-03-06T09:55:22Z

@hanggun can you give more details, how exactly it doesn't work? Throwing errors or just being slow? If it is just slow, it doesn't mean the model wasn't moved to GPU (if you made sure beforehand the device is correct)

I can look into why it slowed down a bit later, prob smth changed in core model loading

hanggun · 2025-03-06T10:02:20Z

@hanggun can you give more details, how exactly it doesn't work? Throwing errors or just being slow? If it is just slow, it doesn't mean the model wasn't moved to GPU (if you made sure beforehand the device is correct)

I can look into why it slowed down a bit later, prob smth changed in core model loading

Thank you very much. It does not throwing errors, maybe it is just slow. Because the time used in 4.49.0 is 20x longer than 4.47.1. So i guess the GPU is not utilised. If you has any information, please tell me~

zucchini-nlp · 2025-03-06T10:56:11Z

Found the issue, after the last release we started auto compiling model generation whenever static cache is used (inlcuding Gemma2 hybrid cache). Compilation usually is very slow on the first call, and it needs warm-up with a few iterations

@gante @ArthurZucker though it seems generating several times, compiles the forward from scratch every time. So we don't see much speed up even when generating after 10 random warmups

hanggun · 2025-03-06T13:56:59Z

Found the issue, after the last release we started auto compiling model generation whenever static cache is used (inlcuding Gemma2 hybrid cache). Compilation usually is very slow on the first call, and it needs warm-up with a few iterations

@gante @ArthurZucker though it seems generating several times, compiles the forward from scratch every time. So we don't see much speed up even when generating after 10 random warmups

Does it the correct behavior. Every start, it caused very long time to begin, it is not convinient

zucchini-nlp · 2025-03-06T14:55:01Z

@hanggun I'd recommend to use earlier version of transformers in the meantime, if generation time is slowing down you work

hanggun · 2025-03-07T01:19:14Z

@hanggun I'd recommend to use earlier version of transformers in the meantime, if generation time is slowing down you work

But the new model class will always be in the new version for example the qwen2.5 should use version 4.49.0. So may I ask this feature will be the consistent feature in the future? Or can I see the compile progress

gante · 2025-03-07T13:36:32Z

@hanggun Gemma2 is a special case, where it has a special cache class. This class happens to be compileable and, in some cases, when generate sees a compileable cache, it attempts to compile the forward pass. This means that Gemma2-related models trigger compilation by default.

The vast majority of models don't trigger compilation by default.

In any case, I'm inspecting to see what's going on, and I will revert the automatic compilation on Gemma2 if I don't find another cause for the slowdowns 🤗

gante · 2025-03-07T13:37:43Z

Ah, you're using quantization! I think quantization and compilation can't happen :) Maybe that's the root cause, I will double-check

gante · 2025-03-07T17:49:47Z

Should be fixed in #36519

(the root cause was indeed quantization, cc @zucchini-nlp )

hanggun · 2025-03-08T01:02:09Z

Should be fixed in #36519

(the root cause was indeed quantization, cc @zucchini-nlp )

Hi, Thank you for your help! However, I think the problem is in both quantization and non quantization. I start use bfloat16 and it triggers compilation. So I seek the code google used in colab that they used quantization for gemma-27b and I tried it to see whether it could deal the problem. And it is not. You can see my code that I only write a config, but I don't use it in the from_pretrained method

hanggun added the bug label Mar 6, 2025

zucchini-nlp added the Cache label Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paligemma2-3B-mix in version4.49.0 not use GPU and 4.50.0.dev broken #36575

paligemma2-3B-mix in version4.49.0 not use GPU and 4.50.0.dev broken #36575

hanggun commented Mar 6, 2025

zucchini-nlp commented Mar 6, 2025

hanggun commented Mar 6, 2025

zucchini-nlp commented Mar 6, 2025 •

edited

Loading

hanggun commented Mar 6, 2025

zucchini-nlp commented Mar 6, 2025

hanggun commented Mar 6, 2025

zucchini-nlp commented Mar 6, 2025

hanggun commented Mar 6, 2025

zucchini-nlp commented Mar 6, 2025

hanggun commented Mar 7, 2025

gante commented Mar 7, 2025

gante commented Mar 7, 2025

gante commented Mar 7, 2025

hanggun commented Mar 8, 2025

paligemma2-3B-mix in version4.49.0 not use GPU and 4.50.0.dev broken #36575

paligemma2-3B-mix in version4.49.0 not use GPU and 4.50.0.dev broken #36575

Comments

hanggun commented Mar 6, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

zucchini-nlp commented Mar 6, 2025

hanggun commented Mar 6, 2025

zucchini-nlp commented Mar 6, 2025 • edited Loading

hanggun commented Mar 6, 2025

zucchini-nlp commented Mar 6, 2025

hanggun commented Mar 6, 2025

zucchini-nlp commented Mar 6, 2025

hanggun commented Mar 6, 2025

zucchini-nlp commented Mar 6, 2025

hanggun commented Mar 7, 2025

gante commented Mar 7, 2025

gante commented Mar 7, 2025

gante commented Mar 7, 2025

hanggun commented Mar 8, 2025

zucchini-nlp commented Mar 6, 2025 •

edited

Loading