-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory usage: new dynamic cache for models supporting sliding window attention #33619
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@Cyrilvallez in a recent PR, extra I don't think I will be able to continue it next week -- would you like to finish that PR for me? :p (the test needs to be finished, a few cache classes are not yet working according to the test) |
@Cyrilvallez suggestion: you can paste images directly to a PR header/comment, which will render the image directly here. It's more convenient for the reader than downloading a file 🤗 like this: screenshot -> drag image file into this text box |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for adding this cache 💛
Have you confirmed that slow tests for relevant models (like mistral) are passing? Or, at least, they introduce no new failure, in case some tests are failing on main
It seems Qwen2 is not happy with these changes :) |
Sure, I'll have a look into it next week 🤗 Slow tests with Mistral were passing, but indeed Qwen2 started to complain, I'm investigating |
946b604
to
4f2e6b2
Compare
ping me once ready for review! 🤗 |
290389d
to
3f09bea
Compare
afbb69a
to
67b2abf
Compare
47dba5d
to
6879866
Compare
What does this PR do?
This PR introduces
DynamicSlidingWindowCache
, a new kind ofDynamicCache
that will stop growing once its size is equal to the sliding window. This allows models using it to have a (dynamic) fix-sized cache, which is a big win for large inputs.The idea is that it becomes the default for models with sliding window attention when no cache arguments are given in
generate
. Let me know what you think @gante @ArthurZuckerHere is a simple visual representation of the new cache for
mistralai/Mistral-7B-v0.1
(we stop growing after hitting the 4096 sliding window):BTW:
SlidingWindowCache
(the static one) is completely broken atm, cannot even be instantiated. Will take a look.