Skip to content

Conversation

patelnav
Copy link

What does this PR do ?

Fixes two TDT beam search timestamp issues:

  1. Crash when using beam search with timestamps=True (NoneType iteration error)
  2. ~160ms offset in timestamps between beam search and greedy decoding

Collection: ASR

Changelog

  • Store token_durations in BatchedBeamHyps for TDT models during beam search
  • Mark timestamp semantics with _timestamp_semantics flag on Hypothesis objects
  • Update _compute_offsets_tdt to handle both START and END timestamp semantics
  • Add backward compatibility for computing durations from timestamp differences
  • Fix docstring: corrected "char" field documentation from List[str] to List[int] (pre-existing bug: y_sequence contains integer token IDs, not strings)

Problem

Issue 1: Crash with beam search + timestamps

When using TDT beam search with timestamps=True, the code crashes with:

TypeError: 'NoneType' object is not iterable
  at nemo/collections/asr/parts/submodules/rnnt_decoding.py:1153 in _compute_offsets_tdt

Root cause: Beam search (BatchedBeamHyps) doesn't populate the token_duration field that _compute_offsets_tdt requires.

Issue 2: ~160ms timestamp offset

After fixing the crash (by computing durations from timestamp diffs), beam search timestamps are still ~160ms late compared to greedy. This occurs because:

  1. Beam search stores END timestamps: timestamp = timesteps + duration
  2. Greedy stores START timestamps: timestamp = timesteps
  3. _compute_offsets_tdt assumed all timestamps were START times
  4. Result: beam search offsets included leading blank frames (~160ms = 4 frames @ 40ms/frame)

Solution

Three-part approach:

  1. Store token durations in BatchedBeamHyps (already receiving them during beam search, now stored)
  2. Mark timestamp semantics with _timestamp_semantics attribute on Hypothesis objects
    • Beam search: "end" (timestamps are END times)
    • Greedy: "start" (timestamps are START times)
  3. Compute correct offsets in _compute_offsets_tdt based on semantics:
    • END semantics: start_offset = timestamp - duration, end_offset = timestamp
    • START semantics: start_offset = timestamp, end_offset = timestamp + duration
    • Fallback heuristic when flag missing (backward compatibility)

Usage

No API changes. The fix is transparent to users:

import nemo.collections.asr as nemo_asr
from omegaconf import OmegaConf

# Load TDT model
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v3")

# Configure beam search (e.g., for GPU phrase boosting)
decoding_cfg = OmegaConf.to_container(model.cfg.decoding, resolve=True)
decoding_cfg["strategy"] = "beam"
decoding_cfg["beam"]["beam_size"] = 4
decoding_cfg["beam"]["return_best_hypothesis"] = True
model.change_decoding_strategy(OmegaConf.create(decoding_cfg))

# Timestamps now work correctly with beam search (no crash, no 160ms offset)
result = model.transcribe(["audio.wav"], timestamps=True)

Impact

  • Enables GPU phrase boosting with timestamps (Issue 1: previously crashed)
  • Eliminates ~160ms timestamp offset (Issue 2: beam search word-level timestamps now align with greedy)
  • Zero performance penalty (durations computed during beam search anyway)
  • Backward compatible (computes durations from diffs if missing)

GitHub Actions CI

Ready for CI. Please add "Run CICD" label.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests? (Can add if requested by reviewers)
  • Did you add or update any necessary documentation? (Updated docstrings)
  • Does the PR affect components that are optional to install? No

PR Type:

  • Bugfix

Who can review?

@andrusenkoau
Per contributor guidelines, requesting review from ASR team:

* store token durations in BatchedBeamHyps for TDT models
* mark timestamp semantics with _timestamp_semantics flag
* update _compute_offsets_tdt to handle END timestamp semantics
* fix docstring: correct char field from List[str] to List[int]

Fixes beam search timestamp offset issue where timestamps were ~160ms
late compared to greedy decoding. Root cause: beam search stores END
timestamps (timesteps + duration) but offset computation expected
START timestamps. Solution stores actual token durations and correctly
interprets timestamps based on explicit semantics flag.

Signed-off-by: Nav Patel <[email protected]>
@github-actions github-actions bot added the ASR label Oct 10, 2025
@patelnav
Copy link
Author

@andrusenkoau this might be of particular interest to you.
I tried the GPU-PB beam-search with timestamps turned on and it immediately crashed.

@artbataev
Copy link
Collaborator

@patelnav Thank you for the PR!
We’re in the middle of a broader work on refactoring and extending transducer beam search capabilities that overlaps with this change, so we’re unsure about merging as-is. Let’s keep it open (or convert to Draft) while that lands, and align on your use cases so we can either rebase/adapt or integrate the ideas directly.
Appreciate your contribution!

@patelnav
Copy link
Author

@patelnav Thank you for the PR! We’re in the middle of a broader work on refactoring and extending transducer beam search capabilities that overlaps with this change, so we’re unsure about merging as-is. Let’s keep it open (or convert to Draft) while that lands, and align on your use cases so we can either rebase/adapt or integrate the ideas directly. Appreciate your contribution!

Sounds good. Best of luck with the refactor. Feel free to do as you wish with the PR 👍🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants