Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama model, torch.compile output for custom device does not match with eager/cpu when generation_config.use_cache set to True #35343

Open
4 tasks
vpandya-quic opened this issue Dec 19, 2024 · 0 comments
Labels

Comments

@vpandya-quic
Copy link

System Info

  • transformers version: 4.43.2
  • Platform: Linux-5.15.0-126-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.26.3
  • Safetensors version: 0.4.5
  • Accelerate version: 1.0.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.0a0+gitee1b680 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker @gone

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

For a custom device I am working on adding torch.compile() with CPP inductor backend.
I am trying run "TinyLlama/TinyLlama-1.1B-Chat-v1.0" and it has output difference when using KV cache in generation.
if I use following config output of compiled mode matches with eager mode.
compiled_model.generation_config.use_cache = False

And for large content length generation I see output similar to #30347

Expected behavior

Please help me debugging this further so that my backend generates correct output with compile mode even with KV cache.

@vpandya-quic vpandya-quic changed the title torch.compile output for custom device does not match with eager/cpu when generation_config.use_cache set to True Llama model, torch.compile output for custom device does not match with eager/cpu when generation_config.use_cache set to True Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant