Replies: 3 comments
-
Similar to what was suggested above, Explosion's curated-transformers takes a manual approach to caching: cache can be returned at each step and passed on to the next sequence generation. |
Beta Was this translation helpful? Give feedback.
-
@phoebeklett You may have some input to this as well, since you've hacked around with HFs internal cache more than anybody :-) |
Beta Was this translation helpful? Give feedback.
-
Right now we do indeed implement memorizing transformers by passing another |
Beta Was this translation helpful? Give feedback.
-
KV caching is a common optimization trick to speedup inference with the transformer architecture. Indeed, a given token only interacts with previous tokens in the attention layer, so we can cache the inputs of the attention blocks (keys and values) for all previous tokens and pass them directly during inference with a new token. It is important to get this right as it can lead to substantial performance increase. See for instance this comment.
There are multiple aspects to integrating KV cache in a library like Outlines:
First we need to build and persist this cache when sampling a new sequence. A linear cache works fine when generating a single sequence, but we can build something more efficient when sampling different sequences. I was originally thinking about building a trie that we query each time the model is queried, but we should also take a close look at PagedAttention.
We also want to persist the cache between generation sequences, especially for infilling workflows where we use the previous completion as a prompt. A first approach would be to cache all the text a model has ever been prompted but this may quickly fill the memory.
@thomasahle suggested to let users handle the caching. To do so we could make
Sequence
instances return a state that contains more than the completion, or a tuple(completion, extra)
by default. The API could look like:An alternative is to pass a
state
toSequence
instances which contains both the completion, the KV cache and potentially other information. This abstracts away KV cache management for infilling workflows:According to @thomasahle, allowing users to pass the KV cache manually would also allow to:
TODO
References
Beta Was this translation helpful? Give feedback.
All reactions