Skip to content

Move standalone ASR audio preprocessing module#15692

Open
yqwangustc wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
yqwangustc:move-standalone-audio-preprocessing
Open

Move standalone ASR audio preprocessing module#15692
yqwangustc wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
yqwangustc:move-standalone-audio-preprocessing

Conversation

@yqwangustc
Copy link
Copy Markdown

@yqwangustc yqwangustc commented May 12, 2026

Implement a standalone module to do logMel feature extraction

The standalone module keeps NeMo-compatible LogMel feature extraction available with only stdlib and PyTorch dependencies inside the implementation. This is needed for consumers that want the NeMo LogMel behavior without depending on the full NeMo ASR stack, such as Lightning, Hydra, librosa, and NeMo neural type imports.

Update the standalone tests to import the new modules path and keep coverage against nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor for output parity, filter-bank parity, dtype conversion, training dither, narrowband augmentation, CUDA execution, and longer input cases.

Benchmark results on NVIDIA H100 80GB HBM3:

  • 1-second 16 kHz inputs, batch 64, 10 warmup, 50 measured iterations: targeted run NeMo 0.514 ms/iter, standalone 0.400 ms/iter, 1.28x speedup; full-file run NeMo 0.475 ms/iter, standalone 0.370 ms/iter, 1.28x speedup.

  • 10-minute 16 kHz inputs, 1 warmup, 3 measured iterations: batch 1 NeMo 1.138 ms/iter, standalone 1.067 ms/iter, 1.07x speedup, 0.41 GiB peak for both.

  • 10-minute 16 kHz inputs, batch 4: NeMo 4.072 ms/iter, standalone 3.869 ms/iter, 1.05x speedup, 1.56 GiB peak for both.

  • 10-minute 16 kHz inputs, batch 8: NeMo 7.840 ms/iter, standalone 7.501 ms/iter, 1.05x speedup, 3.08 GiB peak for both.

  • 10-minute 16 kHz inputs, batch 16: NeMo 15.453 ms/iter, standalone 14.780 ms/iter, 1.05x speedup, 6.13 GiB peak for both.

What does this PR do ?

Implement a standalone module to do logMel feature extraction

Collection: ASR

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc) -- No
    • Reviewer: Does the PR have correct import guards for all optional libraries? -- Yes

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Move the standalone PyTorch LogMel implementation from the temporary nemo/asr/audio_preprocessing.py location to nemo/collections/asr/modules/audio_preprocessing_standalone.py, next to the NeMo ASR audio_preprocessing module it mirrors.

The standalone module keeps NeMo-compatible LogMel feature extraction available with only stdlib and PyTorch dependencies inside the implementation. This is needed for consumers that want the NeMo LogMel behavior without depending on the full NeMo ASR stack, such as Lightning, Hydra, librosa, and NeMo neural type imports.

Update the standalone tests to import the new modules path and keep coverage against nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor for output parity, filter-bank parity, dtype conversion, training dither, narrowband augmentation, CUDA execution, and longer input cases.

Benchmark results on NVIDIA H100 80GB HBM3:

- 1-second 16 kHz inputs, batch 64, 10 warmup, 50 measured iterations: targeted run NeMo 0.514 ms/iter, standalone 0.400 ms/iter, 1.28x speedup; full-file run NeMo 0.475 ms/iter, standalone 0.370 ms/iter, 1.28x speedup.

- 10-minute 16 kHz inputs, 1 warmup, 3 measured iterations: batch 1 NeMo 1.138 ms/iter, standalone 1.067 ms/iter, 1.07x speedup, 0.41 GiB peak for both.

- 10-minute 16 kHz inputs, batch 4: NeMo 4.072 ms/iter, standalone 3.869 ms/iter, 1.05x speedup, 1.56 GiB peak for both.

- 10-minute 16 kHz inputs, batch 8: NeMo 7.840 ms/iter, standalone 7.501 ms/iter, 1.05x speedup, 3.08 GiB peak for both.

- 10-minute 16 kHz inputs, batch 16: NeMo 15.453 ms/iter, standalone 14.780 ms/iter, 1.05x speedup, 6.13 GiB peak for both.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the ASR label May 12, 2026
Copy link
Copy Markdown
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quickly skimmed for now, will review more thoroughly later.

Is this a copy-paste of AudioToMelSpectrogramPreprocessor to a separate file with reduced amount of deps?

If yes, let's make this into a proper refactor so that this class is available in a dedicated file (drop "standalone" from name) and imported in its original sources.

@nithinraok does that sound good to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants