Description
Native streaming ASR pipelines (RNNT/CTC) currently process frames in a highly sequential, non-chunked manner, which limits throughput and scaling efficiency in high-concurrency production environments. Inspired by the recent vLLM-style execution chunking introduced for SpeechLM/SALM (e.g., PRs #15520, #15648), I propose migrating this execution-chunking and centralized state-caching paradigm to the core ASR streaming utilities.
Proposed Architecture
To eliminate memory fragmentation and minimize deep-copy overhead during Python/C++ boundaries, I am planning to implement a centralized StateCacheManager that pre-allocates continuous memory buffers for streaming states. This allows chunked frame execution (e.g., 16/32 frames) to leverage batched matrix multiplications during the forward pass.
To ensure this remains easily reviewable and safe for production, I have broken this feature down into a 4-part modular PR stack:
- PR 1 (Interfaces): Introduce the base
ChunkedInferenceWrapper and StateCacheManager interfaces. This will be strictly opt-in via a use_chunked_inference=True flag in the configuration, maintaining full backward compatibility.
- PR 2 (Core Logic): Integrate chunked execution within
rnnt_models.py and the core forward/transcribe graphs using localized context managers.
- PR 3 (Performance): Implement a zero-copy buffer protocol to share memory pointers directly with the underlying decoders/C++ bindings, preventing costly deep tensor clones during streaming.
- PR 4 (Validation): Add exhaustive parameterized integration tests verifying numerical equivalence (1e-4 tolerance) against the legacy sequential pipeline to ensure zero Word Error Rate (WER) regression.
I would love to get your thoughts on this design direction. If the core team is open to this optimization, I can submit the first PR (Base Interface & State Management) immediately for review.
Description
Native streaming ASR pipelines (RNNT/CTC) currently process frames in a highly sequential, non-chunked manner, which limits throughput and scaling efficiency in high-concurrency production environments. Inspired by the recent vLLM-style execution chunking introduced for SpeechLM/SALM (e.g., PRs #15520, #15648), I propose migrating this execution-chunking and centralized state-caching paradigm to the core ASR streaming utilities.
Proposed Architecture
To eliminate memory fragmentation and minimize deep-copy overhead during Python/C++ boundaries, I am planning to implement a centralized
StateCacheManagerthat pre-allocates continuous memory buffers for streaming states. This allows chunked frame execution (e.g., 16/32 frames) to leverage batched matrix multiplications during the forward pass.To ensure this remains easily reviewable and safe for production, I have broken this feature down into a 4-part modular PR stack:
ChunkedInferenceWrapperandStateCacheManagerinterfaces. This will be strictly opt-in via ause_chunked_inference=Trueflag in the configuration, maintaining full backward compatibility.rnnt_models.pyand the core forward/transcribe graphs using localized context managers.I would love to get your thoughts on this design direction. If the core team is open to this optimization, I can submit the first PR (Base Interface & State Management) immediately for review.