Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V1] VLM - Run the mm_mapper preprocessor in the frontend process #10640

Merged
merged 11 commits into from
Dec 3, 2024

Conversation

alexm-neuralmagic
Copy link
Collaborator

@alexm-neuralmagic alexm-neuralmagic commented Nov 25, 2024

This PR adds support to run the multi-modal mapper/preprocessor (from huggingface) in the frontend process. Execution of 512 prompts with 64 output tokens results in 1.7X improvement. Command used:

VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 python examples/offline_inference_vision_language.py -m llava --num-prompts 512 --modality image

Without frontend generate() time is: 28.91 seconds
With frontend generate() time is: 16.84

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link

mergify bot commented Nov 25, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexm-neuralmagic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@alexm-neuralmagic
Copy link
Collaborator Author

alexm-neuralmagic commented Nov 25, 2024

@rickyyx thanks for the suggestion on trying this!

Copy link
Collaborator

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have the --disable flag?

When would we not want to run this in P0?

Copy link
Collaborator

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easy to understand, this is a great improvement. It would be nice to have a test that compares correctness between disabled and enabled

examples/offline_inference_vision_language.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM!
Regarding to the benchmark, could you use benchmark scripts to showcase the numbers and maybe revert the example script if the change is not necessary?

vllm/config.py Outdated
@@ -125,6 +125,8 @@ class ModelConfig:
HuggingFace config.
mm_processor_kwargs: Arguments to be forwarded to the model's processor
for multi-modal data, e.g., image processor.
mm_disable_frontend_processor: Disables multi-modal HF preprocessor/mapper
execution in the frontend process (not recommended)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
execution in the frontend process (not recommended)
execution in the frontend process (may hurt performance)

@@ -96,6 +100,17 @@ def process_inputs(
sampling_params.update_from_generation_config(
self.generation_config_fields, eos_token_id)

# Process multi-modal data via (huggingface) preprocessor
# here in the frontend process (if enabled)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# here in the frontend process (if enabled)
# here in the frontend process (if enabled); otherwise it will be processed in the engine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed due to removal of disable arg

@alexm-neuralmagic
Copy link
Collaborator Author

@robertgshaw2-neuralmagic I think you right about P0 always running the mm_mapper, so I will remove the disable to simplify the code.

@njhill
Copy link
Member

njhill commented Nov 26, 2024

@robertgshaw2-neuralmagic I think you right about P0 always running the mm_mapper, so I will remove the disable to simplify the code.

Does this mean we can then remove mm_data (and mm_processor_kwargs?) from EngineCoreRequest? :)

@ywang96
Copy link
Member

ywang96 commented Nov 26, 2024

@robertgshaw2-neuralmagic I think you right about P0 always running the mm_mapper, so I will remove the disable to simplify the code.

Does this mean we can then remove mm_data (and mm_processor_kwargs?) from EngineCoreRequest? :)

Yea if P0 is always going to run the multimodal data processor (mm_mapper), then P1 should only need to receive mm_inputs

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! I left two comments so please take a look.

examples/offline_inference_vision_language.py Outdated Show resolved Hide resolved
vllm/v1/engine/core.py Outdated Show resolved Hide resolved
@alexm-neuralmagic alexm-neuralmagic changed the title [V1] VLM - Support running the mm_mapper preprocessor in the frontend process [V1] VLM - Run the mm_mapper preprocessor in the frontend process Dec 2, 2024
@mergify mergify bot removed the needs-rebase label Dec 2, 2024
@alexm-neuralmagic
Copy link
Collaborator Author

@njhill @ywang96 removed mm_data and mm_processor_kwargs from EngineCoreRequest.

# Preprocess multi-modal data
mm_inputs = self.mm_input_mapper.process_inputs(
decoder_inputs.multi_modal_data, decoder_inputs.mm_processor_kwargs
) if decoder_inputs.multi_modal_data is not None else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) if decoder_inputs.multi_modal_data is not None else None
) if not decoder_inputs.multi_modal_data else None

I think this is why entrypoint test is failing - decoder_inputs.multi_modal_data always returns a dictionary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks good catch!

@alexm-neuralmagic
Copy link
Collaborator Author

Will revert changes to offline_inference_vision_language.py and see if I can use the other script.

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 3, 2024
@ywang96
Copy link
Member

ywang96 commented Dec 3, 2024

Confirmed that the fix on tests 382fc0b resolved the issue on CI, so I'm going to auto-merge this.

@ywang96 ywang96 enabled auto-merge (squash) December 3, 2024 10:21
@ywang96 ywang96 merged commit 3bc94ca into main Dec 3, 2024
50 of 51 checks passed
@ywang96 ywang96 deleted the v1_vlm_mapper branch December 3, 2024 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants