Implememt k-v cache for `transformers` models #190

rlouf · 2023-06-21T14:30:53Z

rlouf
Jun 21, 2023
Maintainer

KV caching is a common optimization trick to speedup inference with the transformer architecture. Indeed, a given token only interacts with previous tokens in the attention layer, so we can cache the inputs of the attention blocks (keys and values) for all previous tokens and pass them directly during inference with a new token. It is important to get this right as it can lead to substantial performance increase. See for instance this comment.

There are multiple aspects to integrating KV cache in a library like Outlines:

How do we efficiently manage cache storage/querying during the generation of a single sequence / several samples using the same prompt?
How do we efficiently manage cache store/retrieval between sequence generations?
How do we pass the cache between different sequence generations?

First we need to build and persist this cache when sampling a new sequence. A linear cache works fine when generating a single sequence, but we can build something more efficient when sampling different sequences. I was originally thinking about building a trie that we query each time the model is queried, but we should also take a close look at PagedAttention.

We also want to persist the cache between generation sequences, especially for infilling workflows where we use the previous completion as a prompt. A first approach would be to cache all the text a model has ever been prompted but this may quickly fill the memory.

@thomasahle suggested to let users handle the caching. To do so we could make Sequence instances return a state that contains more than the completion, or a tuple (completion, extra) by default. The API could look like:

import outlines.models as models
import outlines.text as text

model = models.transformers("gpt2")

completion, extra = text.completion(model)(prompt)
_ = text.completion(model, kv_cache=extra.kv_cache)(completion)

An alternative is to pass a state to Sequence instances which contains both the completion, the KV cache and potentially other information. This abstracts away KV cache management for infilling workflows:

import outlines.models as models
import outlines.text as text

model = models.transformers("gpt2")

state = text.completion(model)("A prompt")
state = text.completion(model)(state)

print(state.completion)
# that has been completed

print(state)
# A prompt that has been completed

print(state.kv_cache)
# ...

According to @thomasahle, allowing users to pass the KV cache manually would also allow to:

Pass a kv_cache that has been "learned" by fine-tuning, rather than representing an actual prefix.
Give the model access to a vector DB of "external" KV pairs.

TODO

Summarize the structure of vLLM's caching mechanism

References

The illustrated transformer to understand what the KV cache is;
Transformer inference arithmetic to put numbers behind the KV cache;
vLLM library for virtualization of cache memory

rlouf · 2023-07-13T15:15:49Z

rlouf
Jul 13, 2023
Maintainer Author

Similar to what was suggested above, Explosion's curated-transformers takes a manual approach to caching: cache can be returned at each step and passed on to the next sequence generation.

0 replies

thomasahle · 2023-07-19T17:53:45Z

thomasahle
Jul 19, 2023

@phoebeklett You may have some input to this as well, since you've hacked around with HFs internal cache more than anybody :-)
And since we'll eventually want to do your project using Outlines.

0 replies

phoebeklett · 2023-07-27T22:17:52Z

phoebeklett
Jul 27, 2023

Right now we do indeed implement memorizing transformers by passing another past_key_values type argument, and easy access to past_key_values would be super helpful! I suppose it's a bit messy passing more than a couple keywords args, so the state approach might be nice.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implememt k-v cache for `transformers` models #190

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Implememt k-v cache for transformers models #190

rlouf Jun 21, 2023 Maintainer

TODO

References

Replies: 3 comments

rlouf Jul 13, 2023 Maintainer Author

thomasahle Jul 19, 2023

phoebeklett Jul 27, 2023

Implememt k-v cache for `transformers` models #190

rlouf
Jun 21, 2023
Maintainer

rlouf
Jul 13, 2023
Maintainer Author

thomasahle
Jul 19, 2023

phoebeklett
Jul 27, 2023