You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
For a custom device I am working on adding torch.compile() with CPP inductor backend.
I am trying run "TinyLlama/TinyLlama-1.1B-Chat-v1.0" and it has output difference when using KV cache in generation.
if I use following config output of compiled mode matches with eager mode. compiled_model.generation_config.use_cache = False
And for large content length generation I see output similar to #30347
Expected behavior
Please help me debugging this further so that my backend generates correct output with compile mode even with KV cache.
The text was updated successfully, but these errors were encountered:
vpandya-quic
changed the title
torch.compile output for custom device does not match with eager/cpu when generation_config.use_cache set to True
Llama model, torch.compile output for custom device does not match with eager/cpu when generation_config.use_cache set to True
Dec 19, 2024
System Info
transformers
version: 4.43.2Who can help?
@ArthurZucker @gone
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
For a custom device I am working on adding
torch.compile()
withCPP
inductor backend.I am trying run
"TinyLlama/TinyLlama-1.1B-Chat-v1.0"
and it has output difference when using KV cache in generation.if I use following config output of compiled mode matches with eager mode.
compiled_model.generation_config.use_cache = False
And for large content length generation I see output similar to #30347
Expected behavior
Please help me debugging this further so that my backend generates correct output with compile mode even with KV cache.
The text was updated successfully, but these errors were encountered: