Skip to content

[RFC] Unified vLLM-style Execution Chunking & State Caching for ASR Streaming Pipelines #15740

@shivansh023023

Description

@shivansh023023

Description

Native streaming ASR pipelines (RNNT/CTC) currently process frames in a highly sequential, non-chunked manner, which limits throughput and scaling efficiency in high-concurrency production environments. Inspired by the recent vLLM-style execution chunking introduced for SpeechLM/SALM (e.g., PRs #15520, #15648), I propose migrating this execution-chunking and centralized state-caching paradigm to the core ASR streaming utilities.

Proposed Architecture

To eliminate memory fragmentation and minimize deep-copy overhead during Python/C++ boundaries, I am planning to implement a centralized StateCacheManager that pre-allocates continuous memory buffers for streaming states. This allows chunked frame execution (e.g., 16/32 frames) to leverage batched matrix multiplications during the forward pass.

To ensure this remains easily reviewable and safe for production, I have broken this feature down into a 4-part modular PR stack:

  • PR 1 (Interfaces): Introduce the base ChunkedInferenceWrapper and StateCacheManager interfaces. This will be strictly opt-in via a use_chunked_inference=True flag in the configuration, maintaining full backward compatibility.
  • PR 2 (Core Logic): Integrate chunked execution within rnnt_models.py and the core forward/transcribe graphs using localized context managers.
  • PR 3 (Performance): Implement a zero-copy buffer protocol to share memory pointers directly with the underlying decoders/C++ bindings, preventing costly deep tensor clones during streaming.
  • PR 4 (Validation): Add exhaustive parameterized integration tests verifying numerical equivalence (1e-4 tolerance) against the legacy sequential pipeline to ensure zero Word Error Rate (WER) regression.

I would love to get your thoughts on this design direction. If the core team is open to this optimization, I can submit the first PR (Base Interface & State Management) immediately for review.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions