[ASR] fix streaming multitalker asr timestamp computation by thanhtvt · Pull Request #15701 · NVIDIA-NeMo/NeMo

thanhtvt · 2026-05-14T18:14:38Z

What does this PR do ?

Fix timestamp computation in streaming multitalker ASR for Parakeet model. The _compute_hypothesis_timestamps function had three compounding bugs that caused incorrect segment boundaries, merging utterances across long pauses and producing inflated hypothesis durations.

Collection: ASR

Changelog

Added _prev_token_counts (in ASRState) to track per-speaker progress across streaming chunks, initialized/reset in __init__, _reset_speaker_wise_sentences, and reset.
Added _prev_decoded_lengths (in ASRState) to store the decoder's accumulated frame count per speaker for recovering from silent gaps.
Fixed _compute_hypothesis_timestamps to use prev_token_count (first new token) instead of timestamp[0] (first token ever) for start_time.
Fixed _compute_hypothesis_timestamps to undo the decoder's decoded_lengths shift using decoded_length_before before applying offset, fixing double-counting.
Updated update_sessionwise_seglsts_for_parallel to pass prev_token_count and decoded_length_before to _compute_hypothesis_timestamps and update _prev_decoded_lengths after each chunk.
Update docstring of _compute_hypothesis_timestamps.

Usage

Follow the official guide on how to run Multitalker Parakeet Streaming 0.6B:

python ${NEMO_ROOT}/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
          asr_model="/path/to/your/multitalker-parakeet-streaming-0.6b-v1.nemo" \
          diar_model="/path/to/your/nvidia/diar_streaming_sortformer_4spk-v2.nemo" \
          att_context_size="[70,13]" \
          generate_realtime_scripts=False \
          audio_file="/path/to/example.wav" \
          output_path="/path/to/example_output.json"

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests? → No need to write new tests
Did you add or update any necessary documentation? → I update docstrings of the modified method.
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc) → No
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

I gently tag @nithinraok for this PR, per Contributor guidelines

Additional Information

Root cause: The decoder shifts timestamp indices by prev_batched_state.decoded_lengths at each streaming chunk (global frame indices). The original code was unaware of this shift and compounded three issues:

Wrong token index: Used timestamp[0] (the first token emitted since audio began) instead of the first new token from the current chunk, identified by prev_token_count.
Offset double-counting: Added offset (chunk start time) on top of already-shifted global timestamps, causing all timestamps to drift forward with each chunk.
Silent-gap underestimation: The decoder's decoded_lengths accumulates only while a speaker is active in the batch. When a speaker falls silent for multiple chunks, their decoded_lengths freezes. Resuming speakers produced timestamps that did not account for elapsed silence, causing start_time ≈ last_active_time + small_delta, always within sent_break_sec of the previous segment, forcing all utterances into one merged segment.

Fix: Track _prev_decoded_lengths[spk_idx] to undo the decoder shift, recovering local frame indices.

decoded_length_before = _prev_decoded_lengths[spk_idx]
start_local = timestamp[prev_token_count] - decoded_length_before
end_local = timestamp[-1] - decoded_length_before
start_time = offset + start_local * frame_len_sec
end_time = offset + (end_local + 1) * frame_len_sec
_prev_decoded_lengths[spk_idx] = hypothesis.dec_state.decoded_length.item()

Behavior

For reproducibility, I used the NVIDIA multi-talker ASR video demo on HuggingFace, extracted the .wav audio, and ran the processing script:

Before Fix (Incorrect Durations)

[
    {
        "speaker": "speaker_0",
        "start_time": 1.04,
        "end_time": 38.96,
        "words": "The NVIDIA multitalker ASR system separates and transcribes multiple voices automatically. No enrollment or voice registration is needed. It simply listens, figures out who speaking when, who generates an individual transcript for each person in real time",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_1",
        "start_time": 16.8,
        "end_time": 39.84,
        "words": "It is built to handle overlapping speech naturally. When people fight over each other, the model runs one strain per voice, so each speaker's words stay clear, accurate, and well organized.",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_2",
        "start_time": 29.36,
        "end_time": 65.84,
        "words": "The system also works live. It processes audio as it's captured, delivering captions almost instantly. You can even tune the settings to balance latency and accuracy depending on your application's needs",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_3",
        "start_time": 44.16,
        "end_time": 76.08,
        "words": "And it all builds on the state of the art single speaker ASR Foundation from NVIDIA. We start from a model that already captures human speech with high precision, then extend it to understand many voices at once without sacrificing clarity or performance",
        "session_id": "nvidia-multitalker-asr-demo"
    }
]

After Fix (Corrected Durations)

[
    {
        "speaker": "speaker_0",
        "start_time": 1.04,
        "end_time": 19.92,
        "words": "The NVIDIA multitalker ASR system separates and transcribes multiple voices automatically. No enrollment or voice registration is needed. It simply listens, figures out who speaking when, who generates an individual transcript for each person in real time",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_1",
        "start_time": 16.8,
        "end_time": 27.52,
        "words": "It is built to handle overlapping speech naturally. When people fight over each other, the model runs one strain per voice, so each speaker's words stay clear, accurate, and well organized.",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_2",
        "start_time": 29.2,
        "end_time": 45.68,
        "words": "The system also works live. It processes audio as it's captured, delivering captions almost instantly. You can even tune the settings to balance latency and accuracy depending on your application's needs",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_3",
        "start_time": 44.16,
        "end_time": 59.28,
        "words": "And it all builds on the state of the art single speaker ASR Foundation from NVIDIA. We start from a model that already captures human speech with high precision, then extend it to understand many voices at once without sacrificing clarity or performance",
        "session_id": "nvidia-multitalker-asr-demo"
    }
]

Signed-off-by: thanhtvt <trantrongthanhhp@gmail.com>

copy-pr-bot · 2026-05-14T18:14:42Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

thanhtvt · 2026-05-22T07:11:52Z

@pzelasko @nithinraok Friendly ping. It's surprising this hasn't popped up as a community issue before, but fixing it made a huge difference for my dataset (WER and cpWER stay relatively the same, while tcpWER improves dramatically). Let me know what you think. Thanks!

pzelasko · 2026-05-29T19:02:28Z

/ok to test abe0edb

pzelasko · 2026-05-30T12:30:14Z

/ok to test 99d0726

thanhtvt · 2026-06-09T04:12:11Z

@pzelasko @tango4j Hi, I just want to check in to see if there are any updates on this PR?

pzelasko · 2026-06-09T16:23:20Z

/ok to test 2b075cc

pzelasko · 2026-06-09T16:23:30Z

I'll see if we can fast-track this

ipmedenn

Thanks for the contribution!
LGTM!

thanhtvt · 2026-06-16T03:36:35Z

Hi @pzelasko @nithinraok, just a friendly check. This was approved but hasn't been merged yet. Is there anything else needed? Happy to address any feedback. Thanks!

chtruong814 · 2026-06-16T16:08:39Z

/ok to test 2b075cc

copy-pr-bot · 2026-06-16T16:08:42Z

/ok to test 2b075cc

@chtruong814, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

chtruong814 · 2026-06-16T16:09:01Z

/ok to test f7a10cb

chtruong814 · 2026-06-16T16:09:41Z

/ok to test f7a10cb

thanhtvt · 2026-06-23T04:11:47Z

Any updates on this? I keep seeing the four checks failing, but I can't find any issues in my code. Is there something I missed?

pzelasko · 2026-06-23T12:49:19Z

Looks like all relevant tests passed, I'll just merge. Thanks for your contribution!

fix: streaming multitalker asr timestamp computation

ca84fd9

Signed-off-by: thanhtvt <trantrongthanhhp@gmail.com>

github-actions Bot added ASR community-request labels May 14, 2026

Merge branch 'main' into main

61e90ad

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 16, 2026

Merge branch 'main' into main

abe0edb

svcnvidia-nemo-ci added waiting-on-maintainers Waiting on maintainers to respond and removed waiting-on-maintainers Waiting on maintainers to respond labels May 22, 2026

pzelasko requested a review from tango4j May 29, 2026 19:02

copy-pr-bot Bot temporarily deployed to public May 29, 2026 19:03 Inactive

copy-pr-bot Bot temporarily deployed to test May 29, 2026 19:04 Inactive

copy-pr-bot Bot temporarily deployed to public May 29, 2026 19:06 Inactive

copy-pr-bot Bot temporarily deployed to public May 29, 2026 19:07 Inactive

copy-pr-bot Bot temporarily deployed to public May 29, 2026 19:10 Inactive

svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label May 29, 2026

Merge branch 'main' into main

99d0726

copy-pr-bot Bot temporarily deployed to public May 30, 2026 12:30 Inactive

copy-pr-bot Bot temporarily deployed to test May 30, 2026 12:32 Inactive

copy-pr-bot Bot temporarily deployed to public May 30, 2026 12:34 Inactive

copy-pr-bot Bot temporarily deployed to public May 30, 2026 12:35 Inactive

copy-pr-bot Bot temporarily deployed to public May 30, 2026 12:38 Inactive

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Jun 1, 2026

svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label Jun 9, 2026

Merge branch 'main' into main

2b075cc

copy-pr-bot Bot temporarily deployed to public June 9, 2026 16:24 Inactive

copy-pr-bot Bot temporarily deployed to test June 9, 2026 16:27 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 16:27 Inactive

ipmedenn self-requested a review June 9, 2026 17:04

ipmedenn approved these changes Jun 9, 2026

View reviewed changes

pzelasko enabled auto-merge (squash) June 9, 2026 21:46

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Jun 11, 2026

Merge branch 'main' into main

f7a10cb

copy-pr-bot Bot temporarily deployed to public June 16, 2026 16:10 Inactive

copy-pr-bot Bot temporarily deployed to test June 16, 2026 16:11 Inactive

copy-pr-bot Bot temporarily deployed to public June 16, 2026 16:14 Inactive

copy-pr-bot Bot temporarily deployed to public June 16, 2026 16:15 Inactive

Merge branch 'main' into main

873d00f

pzelasko disabled auto-merge June 23, 2026 12:49

pzelasko merged commit 8ef5664 into NVIDIA-NeMo:main Jun 23, 2026
34 checks passed

svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label Jun 23, 2026

Conversation

thanhtvt commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Behavior

Uh oh!

copy-pr-bot Bot commented May 14, 2026

Uh oh!

thanhtvt commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pzelasko commented May 29, 2026

Uh oh!

pzelasko commented May 30, 2026

Uh oh!

thanhtvt commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pzelasko commented Jun 9, 2026

Uh oh!

pzelasko commented Jun 9, 2026

Uh oh!

ipmedenn left a comment

Choose a reason for hiding this comment

Uh oh!

thanhtvt commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chtruong814 commented Jun 16, 2026

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

chtruong814 commented Jun 16, 2026

Uh oh!

chtruong814 commented Jun 16, 2026

Uh oh!

thanhtvt commented Jun 23, 2026

Uh oh!

pzelasko commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

thanhtvt commented May 14, 2026 •

edited

Loading

thanhtvt commented May 22, 2026 •

edited

Loading

thanhtvt commented Jun 9, 2026 •

edited

Loading

thanhtvt commented Jun 16, 2026 •

edited

Loading