Memory usage: new dynamic cache for models supporting sliding window attention #33619

Cyrilvallez · 2024-09-20T10:59:50Z

What does this PR do?

This PR introduces DynamicSlidingWindowCache, a new kind of DynamicCache that will stop growing once its size is equal to the sliding window. This allows models using it to have a (dynamic) fix-sized cache, which is a big win for large inputs.
The idea is that it becomes the default for models with sliding window attention when no cache arguments are given in generate. Let me know what you think @gante @ArthurZucker

Here is a simple visual representation of the new cache for mistralai/Mistral-7B-v0.1 (we stop growing after hitting the 4096 sliding window):

BTW: SlidingWindowCache (the static one) is completely broken atm, cannot even be instantiated. Will take a look.

HuggingFaceDocBuilderDev · 2024-09-20T11:25:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante · 2024-09-20T15:28:23Z

BTW: SlidingWindowCache (the static one) is completely broken atm, cannot even be instantiated. Will take a look.

@Cyrilvallez in a recent PR, extra super().__init__() lines slipped in. I started working on it on #33297, where it is fixed and a test that checks all caches is added. That PR is actually about an often-requested feature, cache reuse.

I don't think I will be able to continue it next week -- would you like to finish that PR for me? :p (the test needs to be finished, a few cache classes are not yet working according to the test)

gante · 2024-09-20T15:31:52Z

@Cyrilvallez suggestion: you can paste images directly to a PR header/comment, which will render the image directly here. It's more convenient for the reader than downloading a file 🤗

like this: screenshot -> drag image file into this text box
(old)

(new)

gante

LGTM, thank you for adding this cache 💛

Have you confirmed that slow tests for relevant models (like mistral) are passing? Or, at least, they introduce no new failure, in case some tests are failing on main

gante · 2024-09-20T15:51:39Z

It seems Qwen2 is not happy with these changes :)

Cyrilvallez · 2024-09-20T15:54:13Z

BTW: SlidingWindowCache (the static one) is completely broken atm, cannot even be instantiated. Will take a look.

@Cyrilvallez in a recent PR, extra super().__init__() lines slipped in. I started working on it on #33297, where it is fixed and a test that checks all caches is added. That PR is actually about an often-requested feature, cache reuse.

I don't think I will be able to continue it next week -- would you like to finish that PR for me? :p (the test needs to be finished, a few cache classes are not yet working according to the test)

Sure, I'll have a look into it next week 🤗

Slow tests with Mistral were passing, but indeed Qwen2 started to complain, I'm investigating

ArthurZucker · 2024-09-20T23:30:02Z

ping me once ready for review! 🤗

Cyrilvallez changed the title ~~New dynamic cache for models supporting sliding window~~ Memory usage: new dynamic cache for models supporting sliding window Sep 20, 2024

Cyrilvallez changed the title ~~Memory usage: new dynamic cache for models supporting sliding window~~ Memory usage: new dynamic cache for models supporting sliding window attention Sep 20, 2024

gante approved these changes Sep 20, 2024

View reviewed changes

Cyrilvallez added 15 commits September 20, 2024 18:25

Add new dynamic cache

6a33e28

Add cache by default in generate for models supporting it

257f10b

Add to __init__ and correct typo

80c2434

Correct output if prefill larger than sliding window + compatibility

f8a5553

Add legacy format handling

693808f

Update utils.py

220a543

Update utils.py

1f0687d

style

0f05dc4

add docs

fc92887

fix import

ce2b70d

Update dummy_pt_objects.py

2ba750e

Update test

f4c3df9

style

89b56a4

fix typo

617726c

update cache conversion in test

4f2e6b2

Cyrilvallez force-pushed the sliding-window branch from 946b604 to 4f2e6b2 Compare September 20, 2024 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage: new dynamic cache for models supporting sliding window attention #33619

Memory usage: new dynamic cache for models supporting sliding window attention #33619

Cyrilvallez commented Sep 20, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 20, 2024

gante commented Sep 20, 2024 •

edited

Loading

gante commented Sep 20, 2024 •

edited

Loading

gante left a comment

gante commented Sep 20, 2024

Cyrilvallez commented Sep 20, 2024 •

edited

Loading

ArthurZucker commented Sep 20, 2024

Memory usage: new dynamic cache for models supporting sliding window attention #33619

Are you sure you want to change the base?

Memory usage: new dynamic cache for models supporting sliding window attention #33619

Conversation

Cyrilvallez commented Sep 20, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Sep 20, 2024

gante commented Sep 20, 2024 • edited Loading

gante commented Sep 20, 2024 • edited Loading

gante left a comment

Choose a reason for hiding this comment

gante commented Sep 20, 2024

Cyrilvallez commented Sep 20, 2024 • edited Loading

ArthurZucker commented Sep 20, 2024

Cyrilvallez commented Sep 20, 2024 •

edited

Loading

gante commented Sep 20, 2024 •

edited

Loading

gante commented Sep 20, 2024 •

edited

Loading

Cyrilvallez commented Sep 20, 2024 •

edited

Loading