Move standalone ASR audio preprocessing module#15692
Open
yqwangustc wants to merge 1 commit into
Open
Conversation
Move the standalone PyTorch LogMel implementation from the temporary nemo/asr/audio_preprocessing.py location to nemo/collections/asr/modules/audio_preprocessing_standalone.py, next to the NeMo ASR audio_preprocessing module it mirrors. The standalone module keeps NeMo-compatible LogMel feature extraction available with only stdlib and PyTorch dependencies inside the implementation. This is needed for consumers that want the NeMo LogMel behavior without depending on the full NeMo ASR stack, such as Lightning, Hydra, librosa, and NeMo neural type imports. Update the standalone tests to import the new modules path and keep coverage against nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor for output parity, filter-bank parity, dtype conversion, training dither, narrowband augmentation, CUDA execution, and longer input cases. Benchmark results on NVIDIA H100 80GB HBM3: - 1-second 16 kHz inputs, batch 64, 10 warmup, 50 measured iterations: targeted run NeMo 0.514 ms/iter, standalone 0.400 ms/iter, 1.28x speedup; full-file run NeMo 0.475 ms/iter, standalone 0.370 ms/iter, 1.28x speedup. - 10-minute 16 kHz inputs, 1 warmup, 3 measured iterations: batch 1 NeMo 1.138 ms/iter, standalone 1.067 ms/iter, 1.07x speedup, 0.41 GiB peak for both. - 10-minute 16 kHz inputs, batch 4: NeMo 4.072 ms/iter, standalone 3.869 ms/iter, 1.05x speedup, 1.56 GiB peak for both. - 10-minute 16 kHz inputs, batch 8: NeMo 7.840 ms/iter, standalone 7.501 ms/iter, 1.05x speedup, 3.08 GiB peak for both. - 10-minute 16 kHz inputs, batch 16: NeMo 15.453 ms/iter, standalone 14.780 ms/iter, 1.05x speedup, 6.13 GiB peak for both.
pzelasko
reviewed
May 12, 2026
Collaborator
pzelasko
left a comment
There was a problem hiding this comment.
Quickly skimmed for now, will review more thoroughly later.
Is this a copy-paste of AudioToMelSpectrogramPreprocessor to a separate file with reduced amount of deps?
If yes, let's make this into a proper refactor so that this class is available in a dedicated file (drop "standalone" from name) and imported in its original sources.
@nithinraok does that sound good to you?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implement a standalone module to do logMel feature extraction
The standalone module keeps NeMo-compatible LogMel feature extraction available with only stdlib and PyTorch dependencies inside the implementation. This is needed for consumers that want the NeMo LogMel behavior without depending on the full NeMo ASR stack, such as Lightning, Hydra, librosa, and NeMo neural type imports.
Update the standalone tests to import the new modules path and keep coverage against nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor for output parity, filter-bank parity, dtype conversion, training dither, narrowband augmentation, CUDA execution, and longer input cases.
Benchmark results on NVIDIA H100 80GB HBM3:
1-second 16 kHz inputs, batch 64, 10 warmup, 50 measured iterations: targeted run NeMo 0.514 ms/iter, standalone 0.400 ms/iter, 1.28x speedup; full-file run NeMo 0.475 ms/iter, standalone 0.370 ms/iter, 1.28x speedup.
10-minute 16 kHz inputs, 1 warmup, 3 measured iterations: batch 1 NeMo 1.138 ms/iter, standalone 1.067 ms/iter, 1.07x speedup, 0.41 GiB peak for both.
10-minute 16 kHz inputs, batch 4: NeMo 4.072 ms/iter, standalone 3.869 ms/iter, 1.05x speedup, 1.56 GiB peak for both.
10-minute 16 kHz inputs, batch 8: NeMo 7.840 ms/iter, standalone 7.501 ms/iter, 1.05x speedup, 3.08 GiB peak for both.
10-minute 16 kHz inputs, batch 16: NeMo 15.453 ms/iter, standalone 14.780 ms/iter, 1.05x speedup, 6.13 GiB peak for both.
What does this PR do ?
Implement a standalone module to do logMel feature extraction
Collection: ASR
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.