Memory usage: new dynamic cache for models supporting sliding window attention #33619

Cyrilvallez · 2024-09-20T10:59:50Z

What does this PR do?

This PR introduces DynamicSlidingWindowCache, a new kind of DynamicCache that will stop growing once its size is equal to the sliding window. This allows models using it to have a (dynamic) fix-sized cache, which is a big win for large inputs.
The idea is that it becomes the default for models with sliding window attention when no cache arguments are given in generate. Let me know what you think @gante @ArthurZucker

Here is a simple visual representation of the new cache for mistralai/Mistral-7B-v0.1 (we stop growing after hitting the 4096 sliding window):

BTW: SlidingWindowCache (the static one) is completely broken atm, cannot even be instantiated. Will take a look.

HuggingFaceDocBuilderDev · 2024-09-20T11:25:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante · 2024-09-20T15:28:23Z

BTW: SlidingWindowCache (the static one) is completely broken atm, cannot even be instantiated. Will take a look.

@Cyrilvallez in a recent PR, extra super().__init__() lines slipped in. I started working on it on #33297, where it is fixed and a test that checks all caches is added. That PR is actually about an often-requested feature, cache reuse.

I don't think I will be able to continue it next week -- would you like to finish that PR for me? :p (the test needs to be finished, a few cache classes are not yet working according to the test)

gante · 2024-09-20T15:31:52Z

@Cyrilvallez suggestion: you can paste images directly to a PR header/comment, which will render the image directly here. It's more convenient for the reader than downloading a file 🤗

like this: screenshot -> drag image file into this text box
(old)

(new)

gante

LGTM, thank you for adding this cache 💛

Have you confirmed that slow tests for relevant models (like mistral) are passing? Or, at least, they introduce no new failure, in case some tests are failing on main

gante · 2024-09-20T15:51:39Z

It seems Qwen2 is not happy with these changes :)

Cyrilvallez · 2024-09-20T15:54:13Z

BTW: SlidingWindowCache (the static one) is completely broken atm, cannot even be instantiated. Will take a look.

@Cyrilvallez in a recent PR, extra super().__init__() lines slipped in. I started working on it on #33297, where it is fixed and a test that checks all caches is added. That PR is actually about an often-requested feature, cache reuse.

I don't think I will be able to continue it next week -- would you like to finish that PR for me? :p (the test needs to be finished, a few cache classes are not yet working according to the test)

Sure, I'll have a look into it next week 🤗

Slow tests with Mistral were passing, but indeed Qwen2 started to complain, I'm investigating

ArthurZucker · 2024-09-20T23:30:02Z

ping me once ready for review! 🤗

…r prefill stage

…stral)

Cyrilvallez changed the title ~~New dynamic cache for models supporting sliding window~~ Memory usage: new dynamic cache for models supporting sliding window Sep 20, 2024

Cyrilvallez changed the title ~~Memory usage: new dynamic cache for models supporting sliding window~~ Memory usage: new dynamic cache for models supporting sliding window attention Sep 20, 2024

gante approved these changes Sep 20, 2024

View reviewed changes

Cyrilvallez force-pushed the sliding-window branch from 946b604 to 4f2e6b2 Compare September 20, 2024 16:25

Cyrilvallez force-pushed the sliding-window branch 3 times, most recently from 290389d to 3f09bea Compare October 8, 2024 15:53

Cyrilvallez mentioned this pull request Oct 9, 2024

Phi3: fix attn for sliding window #33586

Merged

Cyrilvallez force-pushed the sliding-window branch from afbb69a to 67b2abf Compare October 10, 2024 13:03

Cyrilvallez added 15 commits October 11, 2024 10:56

Add new dynamic cache

d894405

Add cache by default in generate for models supporting it

3b0984b

Add to __init__ and correct typo

345e695

Correct output if prefill larger than sliding window + compatibility

38e82b5

Add legacy format handling

c46a92a

style

02b8506

add docs

7a98aac

fix import

ebe6dc9

Update dummy_pt_objects.py

af95f2a

Update test

08d1a9f

style

b73655a

update cache conversion in test

ff16af0

style

5e3fef0

Allow the cache to support new states of more than 1 token, even afte…

3d1bfd0

…r prefill stage

Update cache_utils.py

6a02bdc

Cyrilvallez added 25 commits October 11, 2024 10:56

Update test_utils.py

25cd9c0

Update test_utils.py

b2f7dee

Update test_utils.py

b549290

Update causal mask generation in case of DynamicSlidingCache (only Mi…

f052bed

…stral)

Improve tests

e091f4d

improve cache

9a30ad4

add exceptions

8202a19

Update utils.py

55a39a6

Update test_utils.py

9caf947

Update test_utils.py

1404cec

Update test_utils.py

4f3ba86

Update test_utils.py

44331f1

Update test_utils.py

b5ebae2

Update 4d mask creation in Mistral

7e78258

fix missed conflict

301f7f2

Apply to other models

be18801

Add required arg in prepare_inoput

734e3fe

Update test_utils.py

106c410

Update test_utils.py

0d8e9ac

Fix kv_seq_length and rotary_seq_length

8509053

up

2ae645f

up

8d539e6

up

e808fa5

up

8499f94

CIs

6879866

Cyrilvallez force-pushed the sliding-window branch from 47dba5d to 6879866 Compare October 11, 2024 08:56

improve sdpa is_causal escape

fe8a625

Cyrilvallez mentioned this pull request Oct 24, 2024

New dynamic cache for sliding window attention #34352

Open

Cyrilvallez closed this Oct 24, 2024

Cyrilvallez deleted the sliding-window branch October 24, 2024 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage: new dynamic cache for models supporting sliding window attention #33619

Memory usage: new dynamic cache for models supporting sliding window attention #33619

Cyrilvallez commented Sep 20, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 20, 2024

gante commented Sep 20, 2024 •

edited

Loading

gante commented Sep 20, 2024 •

edited

Loading

gante left a comment

gante commented Sep 20, 2024

Cyrilvallez commented Sep 20, 2024 •

edited

Loading

ArthurZucker commented Sep 20, 2024

Memory usage: new dynamic cache for models supporting sliding window attention #33619

Memory usage: new dynamic cache for models supporting sliding window attention #33619

Conversation

Cyrilvallez commented Sep 20, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Sep 20, 2024

gante commented Sep 20, 2024 • edited Loading

gante commented Sep 20, 2024 • edited Loading

gante left a comment

Choose a reason for hiding this comment

gante commented Sep 20, 2024

Cyrilvallez commented Sep 20, 2024 • edited Loading

ArthurZucker commented Sep 20, 2024

Cyrilvallez commented Sep 20, 2024 •

edited

Loading

gante commented Sep 20, 2024 •

edited

Loading

gante commented Sep 20, 2024 •

edited

Loading

Cyrilvallez commented Sep 20, 2024 •

edited

Loading