[V1] VLM - Run the mm_mapper preprocessor in the frontend process #10640

alexm-neuralmagic · 2024-11-25T15:55:50Z

This PR adds support to run the multi-modal mapper/preprocessor (from huggingface) in the frontend process. Execution of 512 prompts with 64 output tokens results in 1.7X improvement. Command used:

VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 python examples/offline_inference_vision_language.py -m llava --num-prompts 512 --modality image

Without frontend generate() time is: 28.91 seconds
With frontend generate() time is: 16.84

github-actions · 2024-11-25T15:56:02Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2024-11-25T15:56:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexm-neuralmagic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

alexm-neuralmagic · 2024-11-25T16:00:33Z

@rickyyx thanks for the suggestion on trying this!

robertgshaw2-neuralmagic

Do we need to have the --disable flag?

When would we not want to run this in P0?

examples/offline_inference_vision_language.py

mgoin

Easy to understand, this is a great improvement. It would be nice to have a test that compares correctness between disabled and enabled

examples/offline_inference_vision_language.py

comaniac

Overall LGTM!
Regarding to the benchmark, could you use benchmark scripts to showcase the numbers and maybe revert the example script if the change is not necessary?

comaniac · 2024-11-25T21:18:53Z

vllm/config.py

@@ -125,6 +125,8 @@ class ModelConfig:
            HuggingFace config.
        mm_processor_kwargs: Arguments to be forwarded to the model's processor
            for multi-modal data, e.g., image processor.
+        mm_disable_frontend_processor: Disables multi-modal HF preprocessor/mapper 
+            execution in the frontend process (not recommended)


Suggested change

execution in the frontend process (not recommended)

execution in the frontend process (may hurt performance)

comaniac · 2024-11-25T21:21:14Z

vllm/v1/engine/processor.py

@@ -96,6 +100,17 @@ def process_inputs(
        sampling_params.update_from_generation_config(
            self.generation_config_fields, eos_token_id)

+        # Process multi-modal data via (huggingface) preprocessor
+        # here in the frontend process (if enabled)


Suggested change

# here in the frontend process (if enabled)

# here in the frontend process (if enabled); otherwise it will be processed in the engine.

Changed due to removal of disable arg

alexm-neuralmagic · 2024-11-26T15:28:37Z

@robertgshaw2-neuralmagic I think you right about P0 always running the mm_mapper, so I will remove the disable to simplify the code.

njhill · 2024-11-26T17:36:33Z

@robertgshaw2-neuralmagic I think you right about P0 always running the mm_mapper, so I will remove the disable to simplify the code.

Does this mean we can then remove mm_data (and mm_processor_kwargs?) from EngineCoreRequest? :)

ywang96 · 2024-11-26T17:40:13Z

@robertgshaw2-neuralmagic I think you right about P0 always running the mm_mapper, so I will remove the disable to simplify the code.

Does this mean we can then remove mm_data (and mm_processor_kwargs?) from EngineCoreRequest? :)

Yea if P0 is always going to run the multimodal data processor (mm_mapper), then P1 should only need to receive mm_inputs

ywang96

Overall LGTM! I left two comments so please take a look.

examples/offline_inference_vision_language.py

vllm/v1/engine/core.py

alexm-neuralmagic · 2024-12-02T15:08:33Z

@njhill @ywang96 removed mm_data and mm_processor_kwargs from EngineCoreRequest.

ywang96 · 2024-12-02T17:02:26Z

vllm/v1/engine/processor.py

+        # Preprocess multi-modal data
+        mm_inputs = self.mm_input_mapper.process_inputs(
+            decoder_inputs.multi_modal_data, decoder_inputs.mm_processor_kwargs
+        ) if decoder_inputs.multi_modal_data is not None else None


Suggested change

) if decoder_inputs.multi_modal_data is not None else None

) if not decoder_inputs.multi_modal_data else None

I think this is why entrypoint test is failing - decoder_inputs.multi_modal_data always returns a dictionary.

Thanks good catch!

alexm-neuralmagic · 2024-12-02T18:08:30Z

Will revert changes to offline_inference_vision_language.py and see if I can use the other script.

Signed-off-by: Roger Wang <[email protected]>

ywang96 · 2024-12-03T10:20:56Z

Confirmed that the fix on tests 382fc0b resolved the issue on CI, so I'm going to auto-merge this.

alexm-neuralmagic requested review from WoosukKwon, robertgshaw2-neuralmagic, njhill, ywang96 and comaniac as code owners November 25, 2024 15:55

mergify bot added frontend needs-rebase labels Nov 25, 2024

alexm-neuralmagic self-assigned this Nov 25, 2024

alexm-neuralmagic requested a review from mgoin November 25, 2024 16:03

robertgshaw2-neuralmagic reviewed Nov 25, 2024

View reviewed changes

examples/offline_inference_vision_language.py Outdated Show resolved Hide resolved

mgoin reviewed Nov 25, 2024

View reviewed changes

examples/offline_inference_vision_language.py Outdated Show resolved Hide resolved

comaniac approved these changes Nov 25, 2024

View reviewed changes

ywang96 approved these changes Dec 2, 2024

View reviewed changes

examples/offline_inference_vision_language.py Outdated Show resolved Hide resolved

vllm/v1/engine/core.py Outdated Show resolved Hide resolved

alexm-neuralmagic changed the title ~~[V1] VLM - Support running the mm_mapper preprocessor in the frontend process~~ [V1] VLM - Run the mm_mapper preprocessor in the frontend process Dec 2, 2024

mergify bot removed the needs-rebase label Dec 2, 2024

ywang96 reviewed Dec 2, 2024

View reviewed changes

alexm-neuralmagic added 4 commits December 2, 2024 17:56

allow mm_mapper execution in the frontend process

f04328d

remove disable arg

37f22c3

Nick's comment

7b0d9c4

format

59e6495

alexm-neuralmagic force-pushed the v1_vlm_mapper branch from ee0d70d to 59e6495 Compare December 2, 2024 17:56

Roger's comment

259fbf2

fix

593ae17

Revert offline_inference_vision_language.py

4e730ab

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 3, 2024

ywang96 added 4 commits December 2, 2024 16:17

format

22e8914

Signed-off-by: Roger Wang <[email protected]>

trigger new CI build

7953df3

Signed-off-by: Roger Wang <[email protected]>

Merge branch 'main' into v1_vlm_mapper

152c0ec

fix test

382fc0b

Signed-off-by: Roger Wang <[email protected]>

ywang96 enabled auto-merge (squash) December 3, 2024 10:21

ywang96 merged commit 3bc94ca into main Dec 3, 2024
50 of 51 checks passed

ywang96 deleted the v1_vlm_mapper branch December 3, 2024 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] VLM - Run the mm_mapper preprocessor in the frontend process #10640

[V1] VLM - Run the mm_mapper preprocessor in the frontend process #10640

alexm-neuralmagic commented Nov 25, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 25, 2024

mergify bot commented Nov 25, 2024

alexm-neuralmagic commented Nov 25, 2024 •

edited

Loading

robertgshaw2-neuralmagic left a comment

mgoin left a comment

comaniac left a comment

comaniac Nov 25, 2024

comaniac Nov 25, 2024

alexm-neuralmagic Dec 2, 2024

alexm-neuralmagic commented Nov 26, 2024

njhill commented Nov 26, 2024

ywang96 commented Nov 26, 2024

ywang96 left a comment

alexm-neuralmagic commented Dec 2, 2024

ywang96 Dec 2, 2024

alexm-neuralmagic Dec 2, 2024

alexm-neuralmagic commented Dec 2, 2024

ywang96 commented Dec 3, 2024 •

edited

Loading

	execution in the frontend process (not recommended)
	execution in the frontend process (may hurt performance)

	# here in the frontend process (if enabled)
	# here in the frontend process (if enabled); otherwise it will be processed in the engine.

	) if decoder_inputs.multi_modal_data is not None else None
	) if not decoder_inputs.multi_modal_data else None

[V1] VLM - Run the mm_mapper preprocessor in the frontend process #10640

[V1] VLM - Run the mm_mapper preprocessor in the frontend process #10640

Conversation

alexm-neuralmagic commented Nov 25, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 25, 2024

mergify bot commented Nov 25, 2024

alexm-neuralmagic commented Nov 25, 2024 • edited Loading

robertgshaw2-neuralmagic left a comment

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment

comaniac left a comment

Choose a reason for hiding this comment

comaniac Nov 25, 2024

Choose a reason for hiding this comment

comaniac Nov 25, 2024

Choose a reason for hiding this comment

alexm-neuralmagic Dec 2, 2024

Choose a reason for hiding this comment

alexm-neuralmagic commented Nov 26, 2024

njhill commented Nov 26, 2024

ywang96 commented Nov 26, 2024

ywang96 left a comment

Choose a reason for hiding this comment

alexm-neuralmagic commented Dec 2, 2024

ywang96 Dec 2, 2024

Choose a reason for hiding this comment

alexm-neuralmagic Dec 2, 2024

Choose a reason for hiding this comment

alexm-neuralmagic commented Dec 2, 2024

ywang96 commented Dec 3, 2024 • edited Loading

alexm-neuralmagic commented Nov 25, 2024 •

edited by github-actions bot

Loading

alexm-neuralmagic commented Nov 25, 2024 •

edited

Loading

ywang96 commented Dec 3, 2024 •

edited

Loading