[ASR] fix streaming multitalker asr timestamp computation#15701
Conversation
Signed-off-by: thanhtvt <trantrongthanhhp@gmail.com>
|
@pzelasko @nithinraok Friendly ping. It's surprising this hasn't popped up as a community issue before, but fixing it made a huge difference for my dataset (WER and cpWER stay relatively the same, while tcpWER improves dramatically). Let me know what you think. Thanks! |
|
/ok to test abe0edb |
|
/ok to test 99d0726 |
|
/ok to test 2b075cc |
|
I'll see if we can fast-track this |
ipmedenn
left a comment
There was a problem hiding this comment.
Thanks for the contribution!
LGTM!
|
Hi @pzelasko @nithinraok, just a friendly check. This was approved but hasn't been merged yet. Is there anything else needed? Happy to address any feedback. Thanks! |
|
/ok to test 2b075cc |
@chtruong814, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test f7a10cb |
1 similar comment
|
/ok to test f7a10cb |
|
Any updates on this? I keep seeing the four checks failing, but I can't find any issues in my code. Is there something I missed? |
|
Looks like all relevant tests passed, I'll just merge. Thanks for your contribution! |
What does this PR do ?
Fix timestamp computation in streaming multitalker ASR for Parakeet model. The
_compute_hypothesis_timestampsfunction had three compounding bugs that caused incorrect segment boundaries, merging utterances across long pauses and producing inflated hypothesis durations.Collection: ASR
Changelog
_prev_token_counts(inASRState) to track per-speaker progress across streaming chunks, initialized/reset in__init__,_reset_speaker_wise_sentences, andreset._prev_decoded_lengths(inASRState) to store the decoder's accumulated frame count per speaker for recovering from silent gaps._compute_hypothesis_timestampsto useprev_token_count(first new token) instead oftimestamp[0](first token ever) for start_time._compute_hypothesis_timestampsto undo the decoder'sdecoded_lengthsshift usingdecoded_length_beforebefore applyingoffset, fixing double-counting.update_sessionwise_seglsts_for_parallelto passprev_token_countanddecoded_length_beforeto_compute_hypothesis_timestampsand update_prev_decoded_lengthsafter each chunk._compute_hypothesis_timestamps.Usage
Follow the official guide on how to run Multitalker Parakeet Streaming 0.6B:
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
I gently tag @nithinraok for this PR, per Contributor guidelines
Additional Information
Root cause: The decoder shifts timestamp indices by
prev_batched_state.decoded_lengthsat each streaming chunk (global frame indices). The original code was unaware of this shift and compounded three issues:timestamp[0](the first token emitted since audio began) instead of the first new token from the current chunk, identified byprev_token_count.offset(chunk start time) on top of already-shifted global timestamps, causing all timestamps to drift forward with each chunk.decoded_lengthsaccumulates only while a speaker is active in the batch. When a speaker falls silent for multiple chunks, theirdecoded_lengthsfreezes. Resuming speakers produced timestamps that did not account for elapsed silence, causingstart_time ≈ last_active_time + small_delta, always withinsent_break_secof the previous segment, forcing all utterances into one merged segment.Fix: Track
_prev_decoded_lengths[spk_idx]to undo the decoder shift, recovering local frame indices.Behavior
For reproducibility, I used the NVIDIA multi-talker ASR video demo on HuggingFace, extracted the
.wavaudio, and ran the processing script:Before Fix (Incorrect Durations)
After Fix (Corrected Durations)