Cache: don't throw warnings on `gemma2` when instantiating a new cache #33595

gante · 2024-09-19T14:36:21Z

What does this PR do?

Related to #33541

The warning in question should only be thrown in the case we are converting from a legacy cache, which will be deprecated soon. Gemma 2 doesn't support the legacy cache format, so no warning should ever be thrown :)

In the process, updates a few related inconsistencies.

✅ slow gemma2 tests ran locally. There are a few failures (also present on main). Some failures were fixed in this PR.

gante · 2024-09-19T14:39:26Z

src/transformers/cache_utils.py

    def get_seq_length(self, layer_idx: Optional[int] = 0):
-        return None
+        # Occupied cache == any slot in the 3rd dim (sequence length) holds a non-zero value. To save on compute, let's
+        # limit the check to the first batch member and head dimension.
+        # TODO: deprecate this function in favor of `cache_position`
+        return (self.key_cache[layer_idx][0, 0].any(dim=-1)).sum()


HybridCache is a StaticCache with alternating sliding window layers. The method to retrieve the cache length is copy/paste from StaticCache

We will want to use another method in the future, but let's leave this as a copy of StaticCache for now. This method is needed in the updated gemma 2.

gante · 2024-09-19T14:42:42Z

src/transformers/models/gemma2/modeling_gemma2.py

-                raise ValueError("When `past_key_values` is passed, `cache_position` must be too")
-
-        # Probably a forward call with caching, so we set up cache for one call only
-        if use_cache and past_key_values is None and not self.training:


Two changes here, both to be consistent with other models:

self.training should not control whether we instantiate a cache

If a user respects the types in the docs, past_key_values is either a Cache or we instantiate a new one for the user without warnings

gante · 2024-09-19T14:43:09Z

src/transformers/models/gemma2/modeling_gemma2.py

@@ -840,6 +822,11 @@ def forward(
                dtype=inputs_embeds.dtype,
            )

+        if cache_position is None:


copy/paste from llama (and other Cache-supporting models)

okey, this should always work actually since the seq length gets layer_idx=0. Just one question, isn't it a bit misleading if some layers will have get_seq_length() number of tokens while others no more than sliding window length?

@zucchini-nlp yes, if get_seq_length gets called on the wrong layer we will have problems! I'm going to add an exception if it gets called on layer_idx != 0 (I doubt we need it).

okey sounds good, as long as the function of get_seq_length is transparent for users, to reduce number of cache-related question we get 😄

gante · 2024-09-19T14:43:30Z

src/transformers/models/mimi/modeling_mimi.py

@@ -1000,8 +1000,16 @@ def forward(
            )
            use_cache = False

-        if use_cache and past_key_values is None and not self.training:
-            past_key_values = DynamicCache.from_legacy_cache(past_key_values)
+        if use_cache and not isinstance(past_key_values, Cache):


copy/paste from llama (and other Cache-supporting models)

gante · 2024-09-19T14:44:00Z

tests/models/gemma2/test_modeling_gemma2.py

@@ -86,10 +86,15 @@ def setUp(self):
    def test_model_outputs_equivalence(self, **kwargs):
        pass

+    @parameterized.expand([("float16",), ("bfloat16",), ("float32",)])


without this parameterized, the intended overwriting was not happening

HuggingFaceDocBuilderDev · 2024-09-19T15:00:11Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

LysandreJik

Thank you! Please merge once @zucchini-nlp has approved as she knows this code more than I.

cc @BenjaminBossan as well

zucchini-nlp

LGTM, thanks for cleaning up warnings! Left one question about HybridCache, since I was reluctant to add seq-length for that cache type where lengths are not consistent over layers

zucchini-nlp · 2024-09-19T15:24:24Z

src/transformers/models/gemma2/modeling_gemma2.py

@@ -840,6 +822,11 @@ def forward(
                dtype=inputs_embeds.dtype,
            )

+        if cache_position is None:


okey, this should always work actually since the seq length gets layer_idx=0. Just one question, isn't it a bit misleading if some layers will have get_seq_length() number of tokens while others no more than sliding window length?

BenjaminBossan · 2024-09-19T15:45:51Z

I'm not qualified to review this but thanks for addressing this so quickly.

huggingface#33595)

gante added 2 commits September 19, 2024 13:17

tmp commit

1baf0cc

fix incorrect test inheritance

5bc6856

gante requested review from zucchini-nlp and LysandreJik September 19, 2024 14:36

gante changed the title ~~Cache: don't throw warnings on gemma 2 when instantiating a new cache~~ Cache: don't throw warnings on gemma2 when instantiating a new cache Sep 19, 2024

gante commented Sep 19, 2024

View reviewed changes

LysandreJik approved these changes Sep 19, 2024

View reviewed changes

zucchini-nlp approved these changes Sep 19, 2024

View reviewed changes

PR comment

8f2a096

gante merged commit 52920b5 into huggingface:main Sep 19, 2024
23 checks passed

gante deleted the gemma2_warning branch September 19, 2024 16:42

itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024

Cache: don't throw warnings on gemma2 when instantiating a new cache (

f629df5

huggingface#33595)

amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Oct 2, 2024

Cache: don't throw warnings on gemma2 when instantiating a new cache (

f853ca8

huggingface#33595)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache: don't throw warnings on `gemma2` when instantiating a new cache #33595

Cache: don't throw warnings on `gemma2` when instantiating a new cache #33595

gante commented Sep 19, 2024 •

edited

Loading

gante Sep 19, 2024

gante Sep 19, 2024

gante Sep 19, 2024

zucchini-nlp Sep 19, 2024

gante Sep 19, 2024

zucchini-nlp Sep 19, 2024

gante Sep 19, 2024

gante Sep 19, 2024

HuggingFaceDocBuilderDev commented Sep 19, 2024

LysandreJik left a comment

zucchini-nlp left a comment

zucchini-nlp Sep 19, 2024

BenjaminBossan commented Sep 19, 2024

Cache: don't throw warnings on gemma2 when instantiating a new cache #33595

Cache: don't throw warnings on gemma2 when instantiating a new cache #33595

Conversation

gante commented Sep 19, 2024 • edited Loading

What does this PR do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 19, 2024

LysandreJik left a comment

Choose a reason for hiding this comment

zucchini-nlp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan commented Sep 19, 2024

Cache: don't throw warnings on `gemma2` when instantiating a new cache #33595

Cache: don't throw warnings on `gemma2` when instantiating a new cache #33595

gante commented Sep 19, 2024 •

edited

Loading