You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is a bit like #26553, which implement SinkCache. I would love to see some method of kv cache sparsity like H2O implemented, as proposed in http://arxiv.org/abs/2405.04434.
Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores.
a KV cache eviction policy that dynamically retains a balance of recent and H2 tokens
Your contribution
I would love to help implement this into transformers.
It is not only implement a H2Ocache in src/transformers/cache_utils.py, but also change the order of some code in LlamaAttention#forward function, so Cache#update can get the attention score, which some method of kv cache sparsity like snapKV and future work also need.
The text was updated successfully, but these errors were encountered:
Techniques that improve memory utilization with LLMs are always exciting! At first glance, it seems like a good candidate to be added to transformers with the API you showcased in your example. Two additional points for consideration:
Benchmarks need to be run before merging, to confirm the implementation is working as expected;
You mentioned changes in LlamaAttention.forward, to use the attention scores. We may need a new function for that, like Cache.post_process(), we may need to iterate on the design throughout the PR.
If you're happy with these two points, we'd be happy to take your PR and guide you in the process 🤗
(P.S. your first link is to the DeepSeek-V2 paper, I'm assuming you meant the H2O paper :) )
Feature request
Hello!
It is a bit like #26553, which implement
SinkCache
. I would love to see some method of kv cache sparsity like H2O implemented, as proposed in http://arxiv.org/abs/2405.04434.The authors have release the code here: https://github.com/FMInference/H2O.
People can use it like:
Motivation
Your contribution
I would love to help implement this into transformers.
It is not only implement a
H2Ocache
insrc/transformers/cache_utils.py
, but also change the order of some code inLlamaAttention#forward
function, soCache#update
can get the attention score, which some method of kv cache sparsity like snapKV and future work also need.The text was updated successfully, but these errors were encountered: