Skip to content

ASR: Use length recurrence for streaming pre-encode drop count#15689

Open
1fanwang wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
1fanwang:fix/streaming-drop-extra-pre-encoded-recurrence
Open

ASR: Use length recurrence for streaming pre-encode drop count#15689
1fanwang wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
1fanwang:fix/streaming-drop-extra-pre-encoded-recurrence

Conversation

@1fanwang
Copy link
Copy Markdown

@1fanwang 1fanwang commented May 12, 2026

Closes #15482.

What

The streaming encoder's drop_extra_pre_encoded count was computed as 1 + (cache_size - 1) // subsampling_factor. For convolutional subsampling that formula is only accurate at the default cache_size = subsampling_factor + 1, because the actual ConvSubsampling.forward uses the convolutional length recurrence

L_next = floor((L + all_paddings - kernel_size) / stride) + 1

(or ceil under _ceil_mode), composed over _sampling_num layers.

For any other pre_encode_cache_size, the divisor approximation diverges from what forward produces, so streaming inference drops the wrong number of frames and the chunked output disagrees with a full pass — the mismatch reported in #15482.

This PR routes the drop count through a new get_streaming_drop_size(cache_size) on each subsampler:

  • ConvSubsampling.get_streaming_drop_size uses the same calc_length helper the encoder already uses for the forward pass, so the streaming drop count stays consistent with the encoder's own length bookkeeping.
  • StackingSubsampling.get_streaming_drop_size exposes the exact cache_size // factor relation.
  • ConformerEncoder.setup_streaming_params calls the new method when available; for custom pre_encode modules that predate it, it falls back to the legacy formula (which coincides with the new one only at the default cache_size).

Tests

tests/collections/asr/test_asr_subsampling.py::TestStreamingDropExtraPreEncoded:

  • test_drop_size_matches_forward — parametrized over 4 subsampler shapes (striding/dw_striding × subsampling_factor=4/8) and 7 cache sizes (1, 4, 8, 9, 11, 16, 32). Each case runs the actual forward on a cache_size-long input and asserts the returned out_lengths[0] equals get_streaming_drop_size(cache_size).
  • test_drop_size_legacy_formula_diverges_for_non_default_cache — documents the bug: subsampling_factor=8, cache_size=11 returns 2 under the old formula but the convolutional recurrence (and the actual forward) returns 3.
  • test_drop_size_zero_for_empty_cachecache_size <= 0 → 0.
  • test_stacking_drop_size — exact cache_size // factor for StackingSubsampling.

The new tests fail on main and pass with this PR; the legacy formula case demonstrates the divergence.

The streaming encoder's `drop_extra_pre_encoded` count was computed as
`1 + (cache_size - 1) // subsampling_factor`. For convolutional subsampling
that's only accurate at the default `cache_size = subsampling_factor + 1`
because the actual forward pass uses the convolutional length recurrence
`L_next = floor((L + paddings - kernel) / stride) + 1` (or `ceil` under
`_ceil_mode`) composed over `_sampling_num` layers.

For arbitrary `pre_encode_cache_size` the divisor approximation diverges
from what `forward` produces, so streaming inference drops the wrong
number of frames and the chunked output disagrees with a full pass — the
mismatch reported in NVIDIA-NeMo#15482.

Route the drop count through a new `get_streaming_drop_size(cache_size)`
on each subsampler. `ConvSubsampling` uses the same `calc_length` helper
the encoder already uses for the forward pass; `StackingSubsampling`
exposes the exact `cache_size // factor` relation. The encoder falls back
to the legacy formula only when `pre_encode` is a custom module that
predates this method, where it coincides with the current default.

Tests parametrize 4 subsampler shapes × 7 cache sizes and assert
`get_streaming_drop_size` equals what `forward` actually returns. A
documented case (`subsampling_factor=8`, `cache_size=11`) shows the old
formula returning 2 while the recurrence returns 3.

Closes NVIDIA-NeMo#15482

Signed-off-by: 1fanwang <1fannnw@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Problem with computing drop_extra_pre_encoded when varying pre_encode_cache_size for SubSampling and VGG frontends

2 participants