Feature Idea: Prompt KV-Caching #258

atyshka · 2023-11-02T19:11:43Z

atyshka
Nov 2, 2023

If you have a large prompt that you use repeatedly for generation tasks, it's inefficient to re-compute the embedding of the entire prompt every time. Llama.cpp currently has a feature where you can store the KV cache to disk, run a diff against the new prompt for each run, and only re-compute the KV cache for the last part of the prompt that might have changed. Adding such a feature to TensorRT-LLM could significantly reduce latencies in such scenarios where only the end of the prompt changes between runs. The easiest solution would be caching in-memory, I think this would mainly just require changes to handle_per_step. A more extensive solution could also feature disk caching, like Llama.cpp does. I'm willing to help contribute this feature, first I just need to figure out my other issues with getting a quantized model that'll fit on my rig 😄.

kevinlu1248 · 2025-06-04T02:47:43Z

kevinlu1248
Jun 4, 2025

Might be possible to do this using swap.

0 replies

michaelfeil · 2025-06-06T19:39:15Z

michaelfeil
Jun 6, 2025

This is currently enabled / with the flag kv cache reuse.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Idea: Prompt KV-Caching #258

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feature Idea: Prompt KV-Caching #258

Uh oh!

atyshka Nov 2, 2023

Replies: 3 comments

Uh oh!

kevinlu1248 Jun 4, 2025

Uh oh!

michaelfeil Jun 6, 2025

atyshka
Nov 2, 2023

kevinlu1248
Jun 4, 2025

michaelfeil
Jun 6, 2025