Parakeet TDT decodes empty when short utterance has trailing silence treated as valid audio

## Describe the bug

`nvidia/parakeet-tdt-0.6b-v3` can decode a short speech segment correctly, but the same segment with a short trailing silence tail appended can decode to an empty transcript.

This matters for live/VAD pipelines: final ASR buffers often include a few hundred milliseconds of low-confidence/silence audio used to confirm end-of-speech. In the repro below, the base 2.2s speech crop decodes to text, while the same crop plus 400ms of zeros decodes to `''`.

The preprocessor comparison suggests the tail is being treated as valid audio and changes the normalized log-mel features for the already-spoken prefix. Passing the appended buffer with a shorter valid length, so the tail is treated as padding, keeps the prefix features effectively unchanged.

## Steps/Code to reproduce bug

Environment used:

- `nemo-toolkit==2.7.3`
- `torch==2.10.0+cpu`
- CPU inference, so this does not appear CUDA-specific
- Model: `nvidia/parakeet-tdt-0.6b-v3`

```python
from pathlib import Path
from urllib.request import urlretrieve

import numpy as np
import soundfile as sf
import torch
from nemo.collections.asr.models import ASRModel

SAMPLE_RATE = 16000
url = "https://raw.githubusercontent.com/huggingface/speech-to-speech/main/src/speech_to_speech/TTS/ref_audio.wav"
source_path = Path("ref_audio.wav")
urlretrieve(url, source_path)

audio, sr = sf.read(source_path)
if audio.ndim > 1:
    audio = audio.mean(axis=1)
if sr != SAMPLE_RATE:
    from scipy import signal
    audio = signal.resample(audio, int(round(len(audio) * SAMPLE_RATE / sr)))
audio = np.ascontiguousarray(audio.astype(np.float32))

base = audio[int(1.0 * SAMPLE_RATE) : int(3.2 * SAMPLE_RATE)]
appended = np.concatenate([base, np.zeros(int(0.4 * SAMPLE_RATE), dtype=np.float32)])

sf.write("base.wav", base, SAMPLE_RATE)
sf.write("appended.wav", appended, SAMPLE_RATE)

model = ASRModel.from_pretrained(model_name="nvidia/parakeet-tdt-0.6b-v3")
model.eval()

base_text = model.transcribe(["base.wav"], batch_size=1)[0].text
appended_text = model.transcribe(["appended.wav"], batch_size=1)[0].text
print("base", len(base_text), repr(base_text))
print("appended", len(appended_text), repr(appended_text))

# Direct preprocessor check: compare the already-spoken prefix with the tail
# treated as valid audio vs. treated as padding by passing a shorter length.
def features(samples, valid_samples=None):
    device = next(model.parameters()).device
    signal_t = torch.from_numpy(samples).unsqueeze(0).to(device)
    length = torch.tensor([len(samples) if valid_samples is None else valid_samples], device=device)
    with torch.inference_mode():
        feats, feat_lens = model.preprocessor(input_signal=signal_t, length=length)
    return feats.cpu().float(), int(feat_lens[0].cpu())

base_feats, base_len = features(base)
appended_feats, appended_len = features(appended)
masked_feats, masked_len = features(appended, valid_samples=len(base))

def prefix_diff(a, b, frames):
    diff = (a[:, :, :frames] - b[:, :, :frames]).abs()
    return float(diff.max()), float(diff.mean())

print("feature lengths", {"base": base_len, "appended_valid_tail": appended_len, "appended_tail_as_padding": masked_len})
print("valid tail prefix diff", prefix_diff(base_feats, appended_feats, min(base_len, appended_len)))
print("padded tail prefix diff", prefix_diff(base_feats, masked_feats, min(base_len, masked_len)))
```

Actual output:

```text
base 60 'Some people have super short timelines, yet at the same time'
appended 0 ''
feature lengths {'base': 220, 'appended_valid_tail': 260, 'appended_tail_as_padding': 220}
valid tail prefix diff (1.7800655364990234, 0.28990039229393005)
padded tail prefix diff (8.344650268554688e-07, 6.597878865477469e-08)
```

## Expected behavior

A short trailing silence tail should not cause the whole utterance to decode as empty. At minimum, there should be a recommended inference path for final VAD segments where trailing end-of-speech confirmation audio can be masked or trimmed so it does not change the normalized representation of the speech prefix.

## Additional context

This was found while debugging a speech-to-speech pipeline using progressive/live transcription. The progressive buffer decoded correctly while the user was speaking, but the final VAD buffer sometimes included several hundred milliseconds of silence and decoded to an empty final transcript. The failure was reproducible in NeMo with CPU inference using the script above.

A practical application workaround is to trim/retry the final buffer when a Parakeet TDT decode returns empty, or to pass only the active speech length if the inference stack supports masking the VAD tail as padding.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parakeet TDT decodes empty when short utterance has trailing silence treated as valid audio #15757

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Parakeet TDT decodes empty when short utterance has trailing silence treated as valid audio #15757

Description

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions