-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] VLM - Run the mm_mapper preprocessor in the frontend process #10640
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This pull request has merge conflicts that must be resolved before it can be |
@rickyyx thanks for the suggestion on trying this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have the --disable
flag?
When would we not want to run this in P0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easy to understand, this is a great improvement. It would be nice to have a test that compares correctness between disabled and enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM!
Regarding to the benchmark, could you use benchmark scripts to showcase the numbers and maybe revert the example script if the change is not necessary?
vllm/config.py
Outdated
@@ -125,6 +125,8 @@ class ModelConfig: | |||
HuggingFace config. | |||
mm_processor_kwargs: Arguments to be forwarded to the model's processor | |||
for multi-modal data, e.g., image processor. | |||
mm_disable_frontend_processor: Disables multi-modal HF preprocessor/mapper | |||
execution in the frontend process (not recommended) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
execution in the frontend process (not recommended) | |
execution in the frontend process (may hurt performance) |
vllm/v1/engine/processor.py
Outdated
@@ -96,6 +100,17 @@ def process_inputs( | |||
sampling_params.update_from_generation_config( | |||
self.generation_config_fields, eos_token_id) | |||
|
|||
# Process multi-modal data via (huggingface) preprocessor | |||
# here in the frontend process (if enabled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# here in the frontend process (if enabled) | |
# here in the frontend process (if enabled); otherwise it will be processed in the engine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed due to removal of disable arg
@robertgshaw2-neuralmagic I think you right about P0 always running the mm_mapper, so I will remove the disable to simplify the code. |
Does this mean we can then remove |
Yea if P0 is always going to run the multimodal data processor (mm_mapper), then P1 should only need to receive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! I left two comments so please take a look.
vllm/v1/engine/processor.py
Outdated
# Preprocess multi-modal data | ||
mm_inputs = self.mm_input_mapper.process_inputs( | ||
decoder_inputs.multi_modal_data, decoder_inputs.mm_processor_kwargs | ||
) if decoder_inputs.multi_modal_data is not None else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) if decoder_inputs.multi_modal_data is not None else None | |
) if not decoder_inputs.multi_modal_data else None |
I think this is why entrypoint test is failing - decoder_inputs.multi_modal_data
always returns a dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks good catch!
ee0d70d
to
59e6495
Compare
Will revert changes to offline_inference_vision_language.py and see if I can use the other script. |
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Confirmed that the fix on tests 382fc0b resolved the issue on CI, so I'm going to auto-merge this. |
This PR adds support to run the multi-modal mapper/preprocessor (from huggingface) in the frontend process. Execution of 512 prompts with 64 output tokens results in 1.7X improvement. Command used:
VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 python examples/offline_inference_vision_language.py -m llava --num-prompts 512 --modality image
Without frontend generate() time is: 28.91 seconds
With frontend generate() time is: 16.84