Skip to content

Parakeet TDT decodes empty when short utterance has trailing silence treated as valid audio #15757

@andimarafioti

Description

@andimarafioti

Describe the bug

nvidia/parakeet-tdt-0.6b-v3 can decode a short speech segment correctly, but the same segment with a short trailing silence tail appended can decode to an empty transcript.

This matters for live/VAD pipelines: final ASR buffers often include a few hundred milliseconds of low-confidence/silence audio used to confirm end-of-speech. In the repro below, the base 2.2s speech crop decodes to text, while the same crop plus 400ms of zeros decodes to ''.

The preprocessor comparison suggests the tail is being treated as valid audio and changes the normalized log-mel features for the already-spoken prefix. Passing the appended buffer with a shorter valid length, so the tail is treated as padding, keeps the prefix features effectively unchanged.

Steps/Code to reproduce bug

Environment used:

  • nemo-toolkit==2.7.3
  • torch==2.10.0+cpu
  • CPU inference, so this does not appear CUDA-specific
  • Model: nvidia/parakeet-tdt-0.6b-v3
from pathlib import Path
from urllib.request import urlretrieve

import numpy as np
import soundfile as sf
import torch
from nemo.collections.asr.models import ASRModel

SAMPLE_RATE = 16000
url = "https://raw.githubusercontent.com/huggingface/speech-to-speech/main/src/speech_to_speech/TTS/ref_audio.wav"
source_path = Path("ref_audio.wav")
urlretrieve(url, source_path)

audio, sr = sf.read(source_path)
if audio.ndim > 1:
    audio = audio.mean(axis=1)
if sr != SAMPLE_RATE:
    from scipy import signal
    audio = signal.resample(audio, int(round(len(audio) * SAMPLE_RATE / sr)))
audio = np.ascontiguousarray(audio.astype(np.float32))

base = audio[int(1.0 * SAMPLE_RATE) : int(3.2 * SAMPLE_RATE)]
appended = np.concatenate([base, np.zeros(int(0.4 * SAMPLE_RATE), dtype=np.float32)])

sf.write("base.wav", base, SAMPLE_RATE)
sf.write("appended.wav", appended, SAMPLE_RATE)

model = ASRModel.from_pretrained(model_name="nvidia/parakeet-tdt-0.6b-v3")
model.eval()

base_text = model.transcribe(["base.wav"], batch_size=1)[0].text
appended_text = model.transcribe(["appended.wav"], batch_size=1)[0].text
print("base", len(base_text), repr(base_text))
print("appended", len(appended_text), repr(appended_text))

# Direct preprocessor check: compare the already-spoken prefix with the tail
# treated as valid audio vs. treated as padding by passing a shorter length.
def features(samples, valid_samples=None):
    device = next(model.parameters()).device
    signal_t = torch.from_numpy(samples).unsqueeze(0).to(device)
    length = torch.tensor([len(samples) if valid_samples is None else valid_samples], device=device)
    with torch.inference_mode():
        feats, feat_lens = model.preprocessor(input_signal=signal_t, length=length)
    return feats.cpu().float(), int(feat_lens[0].cpu())

base_feats, base_len = features(base)
appended_feats, appended_len = features(appended)
masked_feats, masked_len = features(appended, valid_samples=len(base))

def prefix_diff(a, b, frames):
    diff = (a[:, :, :frames] - b[:, :, :frames]).abs()
    return float(diff.max()), float(diff.mean())

print("feature lengths", {"base": base_len, "appended_valid_tail": appended_len, "appended_tail_as_padding": masked_len})
print("valid tail prefix diff", prefix_diff(base_feats, appended_feats, min(base_len, appended_len)))
print("padded tail prefix diff", prefix_diff(base_feats, masked_feats, min(base_len, masked_len)))

Actual output:

base 60 'Some people have super short timelines, yet at the same time'
appended 0 ''
feature lengths {'base': 220, 'appended_valid_tail': 260, 'appended_tail_as_padding': 220}
valid tail prefix diff (1.7800655364990234, 0.28990039229393005)
padded tail prefix diff (8.344650268554688e-07, 6.597878865477469e-08)

Expected behavior

A short trailing silence tail should not cause the whole utterance to decode as empty. At minimum, there should be a recommended inference path for final VAD segments where trailing end-of-speech confirmation audio can be masked or trimmed so it does not change the normalized representation of the speech prefix.

Additional context

This was found while debugging a speech-to-speech pipeline using progressive/live transcription. The progressive buffer decoded correctly while the user was speaking, but the final VAD buffer sometimes included several hundred milliseconds of silence and decoded to an empty final transcript. The failure was reproducible in NeMo with CPU inference using the script above.

A practical application workaround is to trim/retry the final buffer when a Parakeet TDT decode returns empty, or to pass only the active speech length if the inference stack supports masking the VAD tail as padding.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions