⚡️ Speed up function demo_postprocess by 39%#49
Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
Open
⚡️ Speed up function demo_postprocess by 39%#49codeflash-ai[bot] wants to merge 1 commit intomainfrom
demo_postprocess by 39%#49codeflash-ai[bot] wants to merge 1 commit intomainfrom
Conversation
The optimized code achieves a **39% speedup** through two key optimizations: ## 1. Grid Caching (~9.5% of runtime) The original code rebuilt meshgrids from scratch on every call via `np.meshgrid(np.arange(wsize), np.arange(hsize))`, taking ~15.3% of runtime. The optimization introduces `_GRID_CACHE` to memoize these grids by `(hsize, wsize)` pairs. Since `demo_postprocess` is called repeatedly with the same input shape (1024, 768) in the inference pipeline (see `function_references`), subsequent calls retrieve cached grids in ~0.1-0.2μs vs ~93μs to rebuild, eliminating redundant computation. ## 2. Eliminating Intermediate Array Concatenation (~88% of runtime) The original code built full `grids` and `expanded_strides` arrays via: - Multiple `np.concatenate()` calls (1.8% runtime) - Broadcasting these large arrays across the entire outputs tensor in vectorized operations (74.4% runtime for the two broadcast multiplications) The optimization replaces this with **per-stride slice processing**: it iterates through each stride block, directly updating the corresponding slice of `outputs` in-place. This avoids: - Allocating temporary arrays (`grids`: 1×8400×2, `expanded_strides`: 1×8400×1 for 1024×768 images) - Broadcasting these arrays across the full tensor - Memory copies during concatenation Instead, each stride's computation uses only a small cached grid (e.g., 1×16384×2 for stride=8) that's added/multiplied with the relevant slice. This reduces peak memory usage and cache thrashing, particularly beneficial for large image sizes (test results show 42.9% speedup for 1280×1280 images). ## Performance Characteristics - **Best for**: Repeated calls with identical `img_size` (caching maximizes benefit) and large images/batches (avoids expensive concatenation overhead). Test cases show 81-329% speedups for such scenarios. - **Workload Impact**: Since `demo_postprocess` is called in the hot path of YOLOX inference (once per image in `image_processing`), this 39% reduction directly improves end-to-end detection throughput. The optimization is especially impactful for batch processing or video streams where the same input shape recurs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📄 39% (0.39x) speedup for
demo_postprocessinunstructured_inference/models/yolox.py⏱️ Runtime :
76.3 milliseconds→54.8 milliseconds(best of5runs)📝 Explanation and details
The optimized code achieves a 39% speedup through two key optimizations:
1. Grid Caching (~9.5% of runtime)
The original code rebuilt meshgrids from scratch on every call via
np.meshgrid(np.arange(wsize), np.arange(hsize)), taking ~15.3% of runtime. The optimization introduces_GRID_CACHEto memoize these grids by(hsize, wsize)pairs. Sincedemo_postprocessis called repeatedly with the same input shape (1024, 768) in the inference pipeline (seefunction_references), subsequent calls retrieve cached grids in ~0.1-0.2μs vs ~93μs to rebuild, eliminating redundant computation.2. Eliminating Intermediate Array Concatenation (~88% of runtime)
The original code built full
gridsandexpanded_stridesarrays via:np.concatenate()calls (1.8% runtime)The optimization replaces this with per-stride slice processing: it iterates through each stride block, directly updating the corresponding slice of
outputsin-place. This avoids:grids: 1×8400×2,expanded_strides: 1×8400×1 for 1024×768 images)Instead, each stride's computation uses only a small cached grid (e.g., 1×16384×2 for stride=8) that's added/multiplied with the relevant slice. This reduces peak memory usage and cache thrashing, particularly beneficial for large image sizes (test results show 42.9% speedup for 1280×1280 images).
Performance Characteristics
img_size(caching maximizes benefit) and large images/batches (avoids expensive concatenation overhead). Test cases show 81-329% speedups for such scenarios.demo_postprocessis called in the hot path of YOLOX inference (once per image inimage_processing), this 39% reduction directly improves end-to-end detection throughput. The optimization is especially impactful for batch processing or video streams where the same input shape recurs.✅ Correctness verification report:
🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-demo_postprocess-mkovoubsand push.