[RFC] Unified vLLM-style Execution Chunking & State Caching for ASR Streaming Pipelines

### Description
Native streaming ASR pipelines (RNNT/CTC) currently process frames in a highly sequential, non-chunked manner, which limits throughput and scaling efficiency in high-concurrency production environments. Inspired by the recent vLLM-style execution chunking introduced for SpeechLM/SALM (e.g., PRs #15520, #15648), I propose migrating this execution-chunking and centralized state-caching paradigm to the core ASR streaming utilities.

### Proposed Architecture
To eliminate memory fragmentation and minimize deep-copy overhead during Python/C++ boundaries, I am planning to implement a centralized `StateCacheManager` that pre-allocates continuous memory buffers for streaming states. This allows chunked frame execution (e.g., 16/32 frames) to leverage batched matrix multiplications during the forward pass.

To ensure this remains easily reviewable and safe for production, I have broken this feature down into a 4-part modular PR stack:

* **PR 1 (Interfaces):** Introduce the base `ChunkedInferenceWrapper` and `StateCacheManager` interfaces. This will be strictly opt-in via a `use_chunked_inference=True` flag in the configuration, maintaining full backward compatibility.
* **PR 2 (Core Logic):** Integrate chunked execution within `rnnt_models.py` and the core forward/transcribe graphs using localized context managers.
* **PR 3 (Performance):** Implement a zero-copy buffer protocol to share memory pointers directly with the underlying decoders/C++ bindings, preventing costly deep tensor clones during streaming.
* **PR 4 (Validation):** Add exhaustive parameterized integration tests verifying numerical equivalence (1e-4 tolerance) against the legacy sequential pipeline to ensure zero Word Error Rate (WER) regression.

I would love to get your thoughts on this design direction. If the core team is open to this optimization, I can submit the first PR (Base Interface & State Management) immediately for review.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Unified vLLM-style Execution Chunking & State Caching for ASR Streaming Pipelines #15740

Description

Proposed Architecture

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[RFC] Unified vLLM-style Execution Chunking & State Caching for ASR Streaming Pipelines #15740

Description

Description

Proposed Architecture

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions