[Core][VLM] Add support for prefix caching for multi-modal models #8348

petersalas · 2024-09-10T23:38:00Z

This adds support for prefix caching with multi-modal models -- in particular it enables it for Ultravox which uses the precise placeholders added in #8346.

With this change, SelfAttnBlockSpaceManager et al. now pass a TokenIds type around instead of List[int] to represent token ids. This new type can also contain TokenRangeAnnotations which capture the contents that will ultimately replace the placeholder tokens. The Sequence calculates these by hashing multi-modal content that supports it (currently only implemented for audio).

FIX #9790

github-actions · 2024-09-10T23:38:12Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/core/block/token_ids.py

petersalas · 2024-09-12T22:52:20Z

vllm/core/block/token_ids.py

+from vllm.sequence import Sequence
+
+
+class TokenRangeAnnotation(NamedTuple):


Very open to suggestions on naming! This is a pretty bulky name.

vllm/core/block_manager_v2.py

ywang96 · 2024-09-16T23:45:51Z

Sorry for the delay - I was busy with Pixtral release last week but will review this PR this week!

petersalas · 2024-09-18T19:16:46Z

vllm/inputs/data.py

+    token_annotations: NotRequired[Optional[List["TokenRangeAnnotation"]]]
+    """
+    Optional token annotations to capture content that will replace portions
+    of the token IDs list.
+    """
+


Given how #8346 has evolved (placeholder ranges are now propagated instead of inlined into MM data) I'll likely remove this and instead compute the annotations downstream once that change lands. But the rest of the change should still be applicable :)

Just want to clarify, what is the difference between the placeholder range data structures in #8346 and the TokenRangeAnnotation data structures in this PR?

Given that you are not creating any new example scripts in this PR, am I correct that the placeholder range data structures in #8346 are more "frontend-oriented" and serve to align placeholder tokens with multimodal input within the prompt (in a way that is model- and workload-specific), while the TokenRangeAnnotation data structures in this PR are more "backend-oriented" and serve to bridge multimodal data into core engine functionality such as prefix cache, block management, etc? With the idea being that the TokenRangeAnnotation's will be computed from the placeholder token range data structures?

Yup, that frontend/backend distinction was how I was thinking about it, but I could be convinced to combine them too.

vllm/core/block/token_ids.py

Isotr0py · 2024-09-29T09:02:19Z

vllm/core/block/token_ids.py

+    def adjusted(self, tokens_start: int,
+                 tokens_end: int) -> Optional["TokenRangeAnnotation"]:


I think using clip would be better expressed than adjusted here for a "range". WDYT?

Also, "slice" could work (this is even the term which is used in the method comment.)

Isotr0py · 2024-09-29T09:31:17Z

vllm/core/block/token_ids.py

+            if key.start is None:
+                start = 0
+            elif key.start < 0:
+                start = len(self) + key.start
+            else:
+                start = key.start


Suggested change

if key.start is None:

start = 0

elif key.start < 0:

start = len(self) + key.start

else:

start = key.start

start = key.start or 0

start += len(self) if start < 0 else 0

vllm/core/block/token_ids.py

afeldman-nm

Hi @petersalas had a few nits and some clarifying questions. Overall very excited for this - very cool how multimodal is integrated into prefix caching. Thanks for the PR!

afeldman-nm · 2024-10-11T10:53:04Z

vllm/core/block/token_ids.py

+    def adjusted(self, tokens_start: int,
+                 tokens_end: int) -> Optional["TokenRangeAnnotation"]:


Also, "slice" could work (this is even the term which is used in the method comment.)

afeldman-nm · 2024-10-11T11:02:42Z

vllm/core/block/token_ids.py

+
+    def adjusted(self, tokens_start: int,
+                 tokens_end: int) -> Optional["TokenRangeAnnotation"]:
+        """


Nit - overall having a little trouble understanding what this method does & why the formulae are as they are; might benefit from explanatory comments for each argument & a few-sentence example on how the token range & content offset get adjusted.

Good suggestion! I added some examples in the docstring which hopefully clarify things a bit.

vllm/core/block/token_ids.py

afeldman-nm · 2024-10-11T12:27:18Z

vllm/core/block/token_ids.py

+                                    key=lambda a: a.token_index)
+        return TokenIds(token_ids, sorted_annotations)
+
+    def chunks(self,


Two thoughts:

What about renaming this to_chunks or get_chunks, in order to make it a little clearer that this method performs a relatively involved process in order to extract chunks?

It looks like the chunks() is invoked at least twice within the engine code; I'm wondering does it make sense to cache the result?

Good suggestion -- renamed it to to_chunks.

I think it'd be tricky to cache, but I did add a fast path in the slice operation since the structure gets sliced for each decoded token and those will always be after the last annotation (if there is any).

afeldman-nm · 2024-10-11T12:51:43Z

vllm/core/block/prefix_caching_block.py

@@ -852,7 +856,9 @@ def hash_block_tokens(is_first_block: bool, prev_block_hash: Optional[int],
        - int: The computed hash value for the block.
        """
        assert (prev_block_hash is None) == is_first_block
-        return hash((is_first_block, prev_block_hash, *cur_block_token_ids))
+        return hash(


So just want to make sure I understand correctly. Regarding prefix caching -

It used to be that a prefix caching block hash was derived from is_first_block, prev block has, and current block token ids

Now, the block hash is additionally derived from the annotations associated with the token ids.

One question I had when I started reviewing this PR was, How does prefix caching match a prefix that includes multimodal data i.e. an image? Is it based on matching the hash of the raw image data?

Since annotations includes multimodal content hashes, it would appear that my guess is correct? So for an image (for example), the TokenRangeAnnotation content hashes might be computed from the raw tokens?

You got it! (With one nit w.r.t. your last question: the hashes are specifically computed for anything that can't be mapped to tokens.)

afeldman-nm · 2024-10-11T13:01:36Z

vllm/inputs/data.py

+    token_annotations: NotRequired[Optional[List["TokenRangeAnnotation"]]]
+    """
+    Optional token annotations to capture content that will replace portions
+    of the token IDs list.
+    """
+


Just want to clarify, what is the difference between the placeholder range data structures in #8346 and the TokenRangeAnnotation data structures in this PR?

Given that you are not creating any new example scripts in this PR, am I correct that the placeholder range data structures in #8346 are more "frontend-oriented" and serve to align placeholder tokens with multimodal input within the prompt (in a way that is model- and workload-specific), while the TokenRangeAnnotation data structures in this PR are more "backend-oriented" and serve to bridge multimodal data into core engine functionality such as prefix cache, block management, etc? With the idea being that the TokenRangeAnnotation's will be computed from the placeholder token range data structures?

afeldman-nm · 2024-10-11T13:07:12Z

vllm/core/block/token_ids.py

+    replace them.
+    """
+
+    content_hash: int


This might be a failure on my end, but where are these hashes actually computed? Are these hashes derived from the unprocessed multimodal data (i.e. raw image pixels for images)?

Will there need to be/is there already a way for the engine to automatically choose the appropriate hash function for a given modality?

Are all of these questions contingent on how #8346 gets integrated with this PR?

Yup, they were -- originally #8346 took an approach of hiding more multi-modal logic away in each multi-modal model and I was going to do the same for hashing (i.e. delegate it to the model). But since I ended up propagating the placeholder ranges explicitly to the Sequence I updated this change to do the MM hashing there as well.

mergify · 2024-10-29T09:15:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. @petersalas please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

petersalas · 2024-11-08T23:13:09Z

vllm/model_executor/models/interfaces.py

+    supports_chunked_prefill: ClassVar[bool] = False
+    """
+    A flag that indicates this model supports chunked prefill.
+    """
+
+    supports_prefix_caching: ClassVar[bool] = False
+    """
+    A flag that indicates this model supports prefix caching.
+    """
+


It's a little weird that these are on SupportsMultiModal but the alternative that came to mind was to require tagging every non-multi-modal model as well. Happy to do whatever reviewers think is best here :)

petersalas · 2024-11-08T23:15:47Z

vllm/core/block/token_ids.py

+
+
+class TokenIds:
+    token_ids: Tuple[int, ...]


I don't particularly love TokenIds.token_ids. Maybe the type should just beTokens? Maybe BlockTokens?

Signed-off-by: Peter Salas <[email protected]>

mergify · 2024-11-13T12:40:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @petersalas.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

cooleel · 2024-12-10T20:13:49Z

Hi, thanks for the great workt! I was wondering if there’s any update on its status or an estimated timeline for its review/merge?

ywang96 · 2024-12-10T20:35:11Z

@cooleel We decided to work on adding prefix caching for multimodal models on V1 instead since there are some fundamental changes on how cache manager is designed. Stay tuned and feel free to check our multimodality roadmap at #4194!

DarkLight1337 · 2024-12-14T02:23:05Z

Closing as superseded by #11187

petersalas commented Sep 10, 2024

View reviewed changes

vllm/core/block/token_ids.py Outdated Show resolved Hide resolved

petersalas mentioned this pull request Sep 10, 2024

[Core][VLM] Add precise multi-modal placeholder tracking #8346

Merged

ywang96 self-assigned this Sep 11, 2024

DarkLight1337 mentioned this pull request Sep 12, 2024

[RFC]: Multi-modality Support on vLLM #4194

Open

67 tasks

ywang96 mentioned this pull request Sep 12, 2024

[Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models #8425

Merged

petersalas changed the title ~~[WIP] [Core][VLM] Add support for placeholder token content hashes~~ [Core][VLM] Add support for placeholder token content hashes Sep 12, 2024

petersalas marked this pull request as ready for review September 12, 2024 22:51

petersalas commented Sep 12, 2024

View reviewed changes

vllm/core/block_manager_v2.py Outdated Show resolved Hide resolved

DarkLight1337 mentioned this pull request Sep 17, 2024

[Doc] Compatibility matrix for mutual exclusive features #8512

Merged

petersalas commented Sep 18, 2024

View reviewed changes

Isotr0py reviewed Sep 29, 2024

View reviewed changes

vllm/core/block/token_ids.py Outdated Show resolved Hide resolved

Isotr0py reviewed Sep 29, 2024

View reviewed changes

vllm/core/block/token_ids.py Outdated Show resolved Hide resolved

afeldman-nm suggested changes Oct 11, 2024

View reviewed changes

mergify bot added the needs-rebase label Oct 29, 2024

mgoin self-requested a review October 30, 2024 23:00

petersalas force-pushed the psalas/annotated-token-ids branch from 49c8c91 to 342d3d0 Compare November 8, 2024 23:06

petersalas requested review from comaniac, KuntaiDu, WoosukKwon, zhuohan123, youkaichao, alexm-neuralmagic and njhill as code owners November 8, 2024 23:06

mergify bot removed the needs-rebase label Nov 8, 2024

petersalas changed the title ~~[Core][VLM] Add support for placeholder token content hashes~~ [Core][VLM] Add support for prefix caching for multi-modal models Nov 8, 2024

petersalas commented Nov 8, 2024

View reviewed changes

[Core] Add support for multimodal models + prefix caching

edf4a55

Signed-off-by: Peter Salas <[email protected]>

petersalas force-pushed the psalas/annotated-token-ids branch from 342d3d0 to edf4a55 Compare November 8, 2024 23:36

mergify bot added the needs-rebase label Nov 13, 2024

petersalas mentioned this pull request Nov 15, 2024

[V1] Support VLMs with fine-grained scheduling #9871

Merged

DarkLight1337 closed this Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][VLM] Add support for prefix caching for multi-modal models #8348

[Core][VLM] Add support for prefix caching for multi-modal models #8348

petersalas commented Sep 10, 2024 •

edited

Loading

github-actions bot commented Sep 10, 2024

petersalas Sep 12, 2024

ywang96 commented Sep 16, 2024

petersalas Sep 18, 2024

afeldman-nm Oct 11, 2024

petersalas Nov 8, 2024

Isotr0py Sep 29, 2024 •

edited

Loading

afeldman-nm Oct 11, 2024

Isotr0py Sep 29, 2024

afeldman-nm left a comment

afeldman-nm Oct 11, 2024

afeldman-nm Oct 11, 2024

petersalas Nov 8, 2024

afeldman-nm Oct 11, 2024

petersalas Nov 8, 2024

afeldman-nm Oct 11, 2024

petersalas Nov 8, 2024

afeldman-nm Oct 11, 2024

afeldman-nm Oct 11, 2024

petersalas Nov 8, 2024

mergify bot commented Oct 29, 2024

petersalas Nov 8, 2024

petersalas Nov 8, 2024

mergify bot commented Nov 13, 2024

cooleel commented Dec 10, 2024

ywang96 commented Dec 10, 2024 •

edited

Loading

DarkLight1337 commented Dec 14, 2024

		from vllm.sequence import Sequence


		class TokenRangeAnnotation(NamedTuple):

		def adjusted(self, tokens_start: int,
		tokens_end: int) -> Optional["TokenRangeAnnotation"]:

[Core][VLM] Add support for prefix caching for multi-modal models #8348

[Core][VLM] Add support for prefix caching for multi-modal models #8348

Conversation

petersalas commented Sep 10, 2024 • edited Loading

github-actions bot commented Sep 10, 2024

Choose a reason for hiding this comment

ywang96 commented Sep 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Isotr0py Sep 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afeldman-nm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Oct 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Nov 13, 2024

cooleel commented Dec 10, 2024

ywang96 commented Dec 10, 2024 • edited Loading

DarkLight1337 commented Dec 14, 2024

petersalas commented Sep 10, 2024 •

edited

Loading

Isotr0py Sep 29, 2024 •

edited

Loading

ywang96 commented Dec 10, 2024 •

edited

Loading