Replies: 3 comments
-
Might be possible to do this using swap. |
Beta Was this translation helpful? Give feedback.
0 replies
-
This is currently enabled / with the flag kv cache reuse. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
If you have a large prompt that you use repeatedly for generation tasks, it's inefficient to re-compute the embedding of the entire prompt every time. Llama.cpp currently has a feature where you can store the KV cache to disk, run a diff against the new prompt for each run, and only re-compute the KV cache for the last part of the prompt that might have changed. Adding such a feature to TensorRT-LLM could significantly reduce latencies in such scenarios where only the end of the prompt changes between runs. The easiest solution would be caching in-memory, I think this would mainly just require changes to handle_per_step. A more extensive solution could also feature disk caching, like Llama.cpp does. I'm willing to help contribute this feature, first I just need to figure out my other issues with getting a quantized model that'll fit on my rig 😄.
Beta Was this translation helpful? Give feedback.
All reactions