Closed
Description
System Info
transformers
version: 4.49.0- Platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.39
- Python version: 3.10.16
- Huggingface_hub version: 0.29.1
- Safetensors version: 0.5.3
- Accelerate version: 1.4.0
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: NVIDIA GeForce RTX 4090
Who can help?
@gante, @SunMarc, @ArthurZucker
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
The following snippet should disable torch.compile
, note the use of disable_compile
as a kwarg. From the documentation, it should replace the corresponding value in generation_config
:
import os
os.environ["TORCH_LOGS"]="+dynamo"
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "google/gemma-2-2b-it"
device = "cuda:0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)
prompt = "<start_of_turn>user\nWrite a poem about the Kraken.<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=50, disable_compile=True)
text = tokenizer.decode(outputs[0])
But we can still see dynamo tracing calls when we run it.
The reason appears to be this line, which uses self.generation_config
instead of generation_config
.
Note that this behaviour will be fixed by #36519 when it's merged. Alternatively, we could fix this issue first if that PR takes long to be approved.
Expected behavior
As discussed above.