ASR: Use length recurrence for streaming pre-encode drop count#15689
Open
1fanwang wants to merge 1 commit into
Open
ASR: Use length recurrence for streaming pre-encode drop count#156891fanwang wants to merge 1 commit into
1fanwang wants to merge 1 commit into
Conversation
The streaming encoder's `drop_extra_pre_encoded` count was computed as `1 + (cache_size - 1) // subsampling_factor`. For convolutional subsampling that's only accurate at the default `cache_size = subsampling_factor + 1` because the actual forward pass uses the convolutional length recurrence `L_next = floor((L + paddings - kernel) / stride) + 1` (or `ceil` under `_ceil_mode`) composed over `_sampling_num` layers. For arbitrary `pre_encode_cache_size` the divisor approximation diverges from what `forward` produces, so streaming inference drops the wrong number of frames and the chunked output disagrees with a full pass — the mismatch reported in NVIDIA-NeMo#15482. Route the drop count through a new `get_streaming_drop_size(cache_size)` on each subsampler. `ConvSubsampling` uses the same `calc_length` helper the encoder already uses for the forward pass; `StackingSubsampling` exposes the exact `cache_size // factor` relation. The encoder falls back to the legacy formula only when `pre_encode` is a custom module that predates this method, where it coincides with the current default. Tests parametrize 4 subsampler shapes × 7 cache sizes and assert `get_streaming_drop_size` equals what `forward` actually returns. A documented case (`subsampling_factor=8`, `cache_size=11`) shows the old formula returning 2 while the recurrence returns 3. Closes NVIDIA-NeMo#15482 Signed-off-by: 1fanwang <1fannnw@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #15482.
What
The streaming encoder's
drop_extra_pre_encodedcount was computed as1 + (cache_size - 1) // subsampling_factor. For convolutional subsampling that formula is only accurate at the defaultcache_size = subsampling_factor + 1, because the actualConvSubsampling.forwarduses the convolutional length recurrence(or
ceilunder_ceil_mode), composed over_sampling_numlayers.For any other
pre_encode_cache_size, the divisor approximation diverges from whatforwardproduces, so streaming inference drops the wrong number of frames and the chunked output disagrees with a full pass — the mismatch reported in #15482.This PR routes the drop count through a new
get_streaming_drop_size(cache_size)on each subsampler:ConvSubsampling.get_streaming_drop_sizeuses the samecalc_lengthhelper the encoder already uses for the forward pass, so the streaming drop count stays consistent with the encoder's own length bookkeeping.StackingSubsampling.get_streaming_drop_sizeexposes the exactcache_size // factorrelation.ConformerEncoder.setup_streaming_paramscalls the new method when available; for custompre_encodemodules that predate it, it falls back to the legacy formula (which coincides with the new one only at the defaultcache_size).Tests
tests/collections/asr/test_asr_subsampling.py::TestStreamingDropExtraPreEncoded:test_drop_size_matches_forward— parametrized over 4 subsampler shapes (striding/dw_striding×subsampling_factor=4/8) and 7 cache sizes (1, 4, 8, 9, 11, 16, 32). Each case runs the actualforwardon acache_size-long input and asserts the returnedout_lengths[0]equalsget_streaming_drop_size(cache_size).test_drop_size_legacy_formula_diverges_for_non_default_cache— documents the bug:subsampling_factor=8,cache_size=11returns 2 under the old formula but the convolutional recurrence (and the actual forward) returns 3.test_drop_size_zero_for_empty_cache—cache_size <= 0→ 0.test_stacking_drop_size— exactcache_size // factorforStackingSubsampling.The new tests fail on main and pass with this PR; the legacy formula case demonstrates the divergence.