[Integration] Expose length-aware batching in all ModelHandler subclasses#37945
[Integration] Expose length-aware batching in all ModelHandler subclasses#37945Eliaaazzz wants to merge 3 commits intoapache:masterfrom
Conversation
|
./gemini review |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces length-aware batching capabilities to various ML inference model handlers across different frameworks (Gemini, HuggingFace, ONNX, PyTorch, Scikit-learn, TensorFlow, TensorRT, vLLM, XGBoost, and Vertex AI). This is achieved by adding batch_length_fn and batch_bucket_boundaries parameters to the constructors of these handlers. New tests have been added to base_test.py to validate this functionality and ensure the new batching parameters are correctly forwarded. A review comment suggests refactoring the newly added HandlerBucketingKwargsForwardingTest class to reduce code repetition and improve maintainability by parameterizing the test methods.
| assert_that(results, equal_to(['a:2', 'bb:2', 'cccccc:7', 'ddddddd:7'])) | ||
|
|
||
|
|
||
| class HandlerBucketingKwargsForwardingTest(unittest.TestCase): |
There was a problem hiding this comment.
The test methods in this class are very repetitive. To improve maintainability and reduce code duplication, consider parameterizing these tests. You could use a library like parameterized or unittest.TestCase.subTest with a loop over a list of handler configurations. Each configuration could specify the handler class, its specific __init__ arguments, and any necessary mocks or setup.
There was a problem hiding this comment.
Good point. These cases all exercise the same forwarding assertion with different handler constructors, so I can consolidate them into a single data-driven test using subTest.
5267548 to
0f077f6
Compare
|
Assigning reviewers: R: @shunping for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
d7f7981 to
b2b8c70
Compare
…sses Completes the smart bucketing feature (apache#37531) by exposing batch_length_fn and batch_bucket_boundaries parameters across all concrete ModelHandler implementations. This allows users to enable length-aware batching on supported inference backends by passing these parameters directly to the handler constructor. - adds batch_length_fn / batch_bucket_boundaries to 16 handler classes - wires Gemini and Vertex AI batching params into _batching_kwargs - adds end-to-end RunInference coverage for length-aware batching - adds per-handler forwarding regression tests and fixes them to be hermetic
b2b8c70 to
af299ec
Compare
…sses Completes the smart bucketing feature (apache#37531) by exposing batch_length_fn and batch_bucket_boundaries parameters across all concrete ModelHandler implementations. This allows users to enable length-aware batching on supported inference backends by passing these parameters directly to the handler constructor. - adds batch_length_fn / batch_bucket_boundaries to 16 handler classes - wires Gemini and Vertex AI batching params into _batching_kwargs - adds end-to-end RunInference coverage for length-aware batching - adds per-handler forwarding regression tests and fixes them to be hermetic
…sses Completes the smart bucketing feature (apache#37531) by exposing batch_length_fn and batch_bucket_boundaries parameters across all concrete ModelHandler implementations. This allows users to enable length-aware batching on supported inference backends by passing these parameters directly to the handler constructor. - adds batch_length_fn / batch_bucket_boundaries to 16 handler classes - wires Gemini and Vertex AI batching params into _batching_kwargs - adds end-to-end RunInference coverage for length-aware batching - adds per-handler forwarding regression tests and fixes them to be hermetic
|
@shunping Hi could you please have a look? All three failed checks show the same infrastructure error: "The self-hosted runner lost communication with the server." No tests actually failed before the runner disconnected. This appears to be a transient Apache Beam CI infrastructure issue with the self-hosted runner dropping offline during execution, and it is unrelated to the changes in this PR. |
|
@damccorm, could you help with this? |
Summary
Addresses #37531.
This PR completes the smart bucketing integration for Python
RunInferenceby exposingbatch_length_fnandbatch_bucket_boundarieson all concreteModelHandlerimplementations.The underlying batching support already exists in the base layer. The missing piece was that many user-facing handlers did not surface these options, which made length-aware batching effectively unavailable for a large part of the inference API surface. With this change, users can enable smart bucketing directly from the handler constructor across supported backends.
What Changed
This change adds
batch_length_fnandbatch_bucket_boundariesto 16 concrete handlers across the following backends:Implementation details:
ModelHandlernow pass the new parameters through tosuper().__init__()GeminiModelHandlerandVertexAIModelHandlerJSON) now wire the values into_batching_kwargsTesting
Added test coverage in
base_test.pyfor both behavior and wiring:RunInferenceLengthAwareBatchingTestthat verifies short and long string inputs are bucketed into separate batches underFnApiRunnerHandlerBucketingKwargsForwardingTestthat checks each concrete handler forwardsbatch_length_fnandbatch_bucket_boundariesintobatch_elements_kwargs()Context
This is the final integration piece for smart bucketing:
Together, these changes make length-aware batching usable through the public Python inference handlers rather than only at the base implementation layer.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md