-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] VLM preprocessor hashing #11020
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This pull request has merge conflicts that must be resolved before it can be |
545a40a
to
3554439
Compare
/ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Just some comments for style. Also please
- Revert unnecessary changes in the example code.
- Add some unit tests.
vllm/v1/engine/processor.py
Outdated
mm_hashes = self.mm_hasher.hash(decoder_inputs.multi_modal_data) \ | ||
if self.mm_hasher is not None else None | ||
|
||
mm_inputs, mm_hashes = self.mm_input_mapper_client.process_inputs( | ||
decoder_inputs.multi_modal_data, mm_hashes, | ||
decoder_inputs.mm_processor_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably need some assertions to make sure mm_hasher and mm_input_mapper_client are not None.
@comaniac thanks for the quick review! I actually realized that I need to push the hash and the cache into the _process_multimodal(..) due to recent merging of preprocessor for llava. Also it caches the first preprocessor too. Will send changes tomorrow. |
This pull request has merge conflicts that must be resolved before it can be |
0e26569
to
f4466ba
Compare
0c69c14
to
7304434
Compare
@comaniac code is ready for re-review. Eventually, I did not push the hasher/cacher into the _process_multimodal(..) since it is also used to compute the MM placeholders, and there is no simple way to skip it due to hash HIT. We can look into this as a follow up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just some nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does these changes compatible with v0? If not then we should differentiate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it should not matter if it is v0 or v1, since it simply controls the images used.
vllm/v1/engine/core.py
Outdated
if request.mm_hashes is not None: | ||
request.mm_inputs = self.mm_input_mapper_server.process_inputs( | ||
request.mm_inputs, request.mm_hashes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, if the hash is available we will either get the input from cache or process the input here; otherwise the input will be processed in the engine. Is that correct? If so it'd be better to add comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment with more explanation
vllm/v1/engine/mm_input_mapper.py
Outdated
@@ -18,23 +42,133 @@ def __init__( | |||
model_config) | |||
self.mm_registry.init_mm_limits_per_prompt(model_config) | |||
|
|||
self.mm_cache = LRUDictCache(MM_CACHE_SIZE) | |||
|
|||
# Set to None to disable (TODO: Disable!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you going to disable it before merging this PR? If not then maybe remove the TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, removed TODO and disabled
vllm/v1/engine/mm_input_mapper.py
Outdated
self.mm_cache_misses = 0 | ||
|
||
def cache_hit_ratio(self, steps) -> float: | ||
total_steps = self.mm_cache_hits + self.mm_cache_misses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is having self.total instead of self.miss, because you actually don't need self.miss anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, changed to use total
185655c
to
6a7b37a
Compare
decoder_inputs.multi_modal_data, | ||
decoder_inputs.mm_processor_kwargs, | ||
) | ||
# For merged preprocessor, mm_data is already mm_inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: what's a merged preprocessor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR: #10676
The MM model has actually 2 preprocessors, and this PR combines them to one (but only for llava)
9ec0c15
to
431f5d4
Compare
Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: Alexander Matveev <[email protected]>
431f5d4
to
adda9d4
Compare
This PR adds:
The idea of MM preprocessor caching is based on having a client and a server, where the client executes in the frontend process (=P0) and the server in the core process (=P1).
The caching for both client and server is mirrored/similar, and this allows us to avoid the serialization of "mm_inputs" (like pixel values) between client (=P0) and server (=P1) processes.
Currently, the MM preprocessor caching is disabled by default, since we did not finish the performance analysis yet. We still need to enable encoder caching and prefix caching end-to-end.
Follow up PRs: