Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paligemma2-3B-mix in version4.49.0 not use GPU and 4.50.0.dev broken #36575

Open
4 tasks
hanggun opened this issue Mar 6, 2025 · 14 comments
Open
4 tasks

paligemma2-3B-mix in version4.49.0 not use GPU and 4.50.0.dev broken #36575

hanggun opened this issue Mar 6, 2025 · 14 comments

Comments

@hanggun
Copy link

hanggun commented Mar 6, 2025

System Info

  • transformers version: 4.50.0.dev0
  • Platform: Linux-5.15.0-86-generic-x86_64-with-glibc2.35
  • Python version: 3.12.3
  • Huggingface_hub version: 0.29.2
  • Safetensors version: 0.5.3
  • Accelerate version: 1.2.1
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: NO
    - mixed_precision: bf16
    - use_cpu: False
    - debug: False
    - num_processes: 1
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: 0
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - enable_cpu_affinity: False
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • DeepSpeed version: not installed
  • PyTorch version (GPU?): 2.5.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA A100-PCIE-40GB

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import (
PaliGemmaProcessor,
PaliGemmaForConditionalGeneration,
)
from PIL import Image
import torch
import os
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

model_id = "/home/ps/data/pretrained_model/google/paligemma2-3b-mix-448/"

image = Image.open('40a07304-411a-41f9-afd0-30e25a145399.png')

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16,
device_map="auto").eval()
processor = PaliGemmaProcessor.from_pretrained(model_id)

prompt = "describe en"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to(model.device)
input_len = model_inputs["input_ids"].shape[-1]
print(model_inputs)

with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

Expected behavior

generate caption within 2 second, I tested version 4.47.1 and 4.48.3, it is as fast as possible, Howver, in 4.49.0, it does not use GPU, maybe it is compiled in CPU, and it is very slow. In 4.50.0.dev version, here is some graph breaks information

skipping cudagraphs due to mutated inputs (52 instances). Found from :
   File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/paligemma/modeling_paligemma.py", line 532, in forward
    outputs = self.language_model(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 886, in forward
    outputs = self.model(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 667, in forward
    layer_outputs = decoder_layer(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 321, in forward
    hidden_states, self_attn_weights = self.self_attn(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 231, in forward
    key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/cache_utils.py", line 1732, in update
    return update_fn(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/cache_utils.py", line 1696, in _sliding_update
    self.key_cache[layer_idx] += k_out

skipping cudagraphs due to mutated inputs (52 instances). Found from :
   File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/paligemma/modeling_paligemma.py", line 532, in forward
    outputs = self.language_model(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 886, in forward
    outputs = self.model(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 667, in forward
    layer_outputs = decoder_layer(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 321, in forward
    hidden_states, self_attn_weights = self.self_attn(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 231, in forward
    key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/cache_utils.py", line 1732, in update
    return update_fn(
  File "/home/ps/jupyter/venv/lib/python3.12/site-packages/transformers/cache_utils.py", line 1696, in _sliding_update
    self.key_cache[layer_idx] += k_out
and repeated so on ....

Please help me check and fixed this error

@hanggun hanggun added the bug label Mar 6, 2025
@zucchini-nlp
Copy link
Member

Graph breaks will be fixed by #36543. We had two clashing PRs merged and didn't catch the graphbreak

@hanggun
Copy link
Author

hanggun commented Mar 6, 2025

Graph breaks will be fixed by #36543. We had two clashing PRs merged and didn't catch the graphbreak

Does #36543 will also fixed the problem of paligemma2 does not use GPU. The PR is two days ago and I have installed the latest code of transformers. Does this PR still not implemented? When can I test the new version of transformers

@zucchini-nlp
Copy link
Member

zucchini-nlp commented Mar 6, 2025

Do you mean the model is not moved to gpu, even after model.to('cuda')? AFAIR it should work

The cache fix will be merged for the next release, I am not sure when exactly it will be ready. Feel free to track progress under the PR

@hanggun
Copy link
Author

hanggun commented Mar 6, 2025

Do you mean the model is not moved to gpu, even after model.to('cuda')? AFAIR it should work

it is moved to GPU, but it is not work, The GPU utilization is always 0 in 4.49.0 and 4.50.0.dev, but it is OK in 4.47.1 and 4.48.3

@zucchini-nlp
Copy link
Member

@hanggun can you give more details, how exactly it doesn't work? Throwing errors or just being slow? If it is just slow, it doesn't mean the model wasn't moved to GPU (if you made sure beforehand the device is correct)

I can look into why it slowed down a bit later, prob smth changed in core model loading

@hanggun
Copy link
Author

hanggun commented Mar 6, 2025

@hanggun can you give more details, how exactly it doesn't work? Throwing errors or just being slow? If it is just slow, it doesn't mean the model wasn't moved to GPU (if you made sure beforehand the device is correct)

I can look into why it slowed down a bit later, prob smth changed in core model loading

Thank you very much. It does not throwing errors, maybe it is just slow. Because the time used in 4.49.0 is 20x longer than 4.47.1. So i guess the GPU is not utilised. If you has any information, please tell me~

@zucchini-nlp
Copy link
Member

Found the issue, after the last release we started auto compiling model generation whenever static cache is used (inlcuding Gemma2 hybrid cache). Compilation usually is very slow on the first call, and it needs warm-up with a few iterations

@gante @ArthurZucker though it seems generating several times, compiles the forward from scratch every time. So we don't see much speed up even when generating after 10 random warmups

@hanggun
Copy link
Author

hanggun commented Mar 6, 2025

Found the issue, after the last release we started auto compiling model generation whenever static cache is used (inlcuding Gemma2 hybrid cache). Compilation usually is very slow on the first call, and it needs warm-up with a few iterations

@gante @ArthurZucker though it seems generating several times, compiles the forward from scratch every time. So we don't see much speed up even when generating after 10 random warmups

Does it the correct behavior. Every start, it caused very long time to begin, it is not convinient

@zucchini-nlp
Copy link
Member

@hanggun I'd recommend to use earlier version of transformers in the meantime, if generation time is slowing down you work

@hanggun
Copy link
Author

hanggun commented Mar 7, 2025

@hanggun I'd recommend to use earlier version of transformers in the meantime, if generation time is slowing down you work

But the new model class will always be in the new version for example the qwen2.5 should use version 4.49.0. So may I ask this feature will be the consistent feature in the future? Or can I see the compile progress

@gante
Copy link
Member

gante commented Mar 7, 2025

@hanggun Gemma2 is a special case, where it has a special cache class. This class happens to be compileable and, in some cases, when generate sees a compileable cache, it attempts to compile the forward pass. This means that Gemma2-related models trigger compilation by default.

The vast majority of models don't trigger compilation by default.

In any case, I'm inspecting to see what's going on, and I will revert the automatic compilation on Gemma2 if I don't find another cause for the slowdowns 🤗

@gante
Copy link
Member

gante commented Mar 7, 2025

Ah, you're using quantization! I think quantization and compilation can't happen :) Maybe that's the root cause, I will double-check

@gante
Copy link
Member

gante commented Mar 7, 2025

Should be fixed in #36519

(the root cause was indeed quantization, cc @zucchini-nlp )

@hanggun
Copy link
Author

hanggun commented Mar 8, 2025

Should be fixed in #36519

(the root cause was indeed quantization, cc @zucchini-nlp )

Hi, Thank you for your help! However, I think the problem is in both quantization and non quantization. I start use bfloat16 and it triggers compilation. So I seek the code google used in colab that they used quantization for gemma-27b and I tried it to see whether it could deal the problem. And it is not. You can see my code that I only write a config, but I don't use it in the from_pretrained method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants