Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silero VAD support #888

Merged
merged 3 commits into from
Jan 11, 2025
Merged

Silero VAD support #888

merged 3 commits into from
Jan 11, 2025

Conversation

3manifold
Copy link
Contributor

@3manifold 3manifold commented Sep 26, 2024

Description

Implementation includes:

  • Extension of WhisperX to accept multiple VAD alternatives that do not have to necessarily emerge from pyannote-audio toolkit.
  • Silero VAD as an alternative VAD option.
  • Fix in whisperx\__init__.py imports.

The implementation aims to respect the current structure as well as keep the existing functionality intact. It is worth mentioning that the manually-assigned vad_model still works as expected (see load_model for details).

See relevant issue for further details. resolves #889

Tests

  • pyannote and silero cases both tested on CPU & GPU setups without an issue (current silero vad implementation utilizes only CPU)
  • Also tested using manually assigned vad_model (manually assigned vad_model has higher priority than vad_method, see load_model function for details)
  • Test were conducted using .wav files of various lengths (30s, 15min, 1hr)

Example command line (applies also for --vad_method pyannote):

  • GPU: python3 -m whisperx.transcribe audio.wav --language en --device cuda --diarize --hf_token xxx --vad_method silero
  • CPU: python3 -m whisperx.transcribe audio.wav --language en --device cpu --diarize --hf_token xxx --compute_type int8 --vad_method silero

Example Python script usage:

import whisperx
import gc

device = "cpu"
audio_file = "audio.wav"
batch_size = 16 # reduce if low on GPU mem
compute_type = "int8" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("small", device, vad_method="silero", compute_type=compute_type)

# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

# 3. Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token="xxx", device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)
print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs

output:

click to expand
python3 whisperx/example.py 
torchvision is not available - cannot save figures
No language specified, language will be first be detected for each audio file (increases inference time).
>>Performing voice activity detection using Silero...
Using cache found in /home/xxx/.cache/torch/hub/snakers4_silero-vad_master
Detected language: en (0.99) in first 30s of audio...
[{'text': ' Birch canoes slid on the smooth planks. Glued the sheet to the dark blue background. It is easy to tell the depth of a well. These days a chicken leg is a rare dish. Rice is often served in round bowls. The juice of lemons makes fine punch. The box was thrown beside the parked truck. The hogs were fed chopped corn and garbage. Four hours of study work faced us.', 'start': 0.674, 'end': 28.83}, {'text': ' A large size in stockings is hard to sell.', 'start': 30.05, 'end': 32.254}]
[{'start': 0.694, 'end': 2.995, 'text': ' Birch canoes slid on the smooth planks.', 'words': [{'word': 'Birch', 'start': 0.694, 'end': 1.034, 'score': 0.854}, {'word': 'canoes', 'start': 1.114, 'end': 1.555, 'score': 0.763}, {'word': 'slid', 'start': 1.595, 'end': 1.915, 'score': 0.881}, {'word': 'on', 'start': 2.015, 'end': 2.095, 'score': 0.909}, {'word': 'the', 'start': 2.115, 'end': 2.195, 'score': 0.789}, {'word': 'smooth', 'start': 2.255, 'end': 2.615, 'score': 0.828}, {'word': 'planks.', 'start': 2.695, 'end': 2.995, 'score': 0.861}]}, {'start': 4.296, 'end': 6.357, 'text': 'Glued the sheet to the dark blue background.', 'words': [{'word': 'Glued', 'start': 4.296, 'end': 4.616, 'score': 0.474}, {'word': 'the', 'start': 4.676, 'end': 4.756, 'score': 0.968}, {'word': 'sheet', 'start': 4.796, 'end': 5.016, 'score': 0.933}, {'word': 'to', 'start': 5.056, 'end': 5.157, 'score': 0.776}, {'word': 'the', 'start': 5.177, 'end': 5.237, 'score': 0.952}, {'word': 'dark', 'start': 5.277, 'end': 5.517, 'score': 0.99}, {'word': 'blue', 'start': 5.577, 'end': 5.777, 'score': 0.844}, {'word': 'background.', 'start': 5.837, 'end': 6.357, 'score': 0.93}]}, {'start': 7.838, 'end': 9.659, 'text': 'It is easy to tell the depth of a well.', 'words': [{'word': 'It', 'start': 7.838, 'end': 7.918, 'score': 0.932}, {'word': 'is', 'start': 7.978, 'end': 8.058, 'score': 0.724}, {'word': 'easy', 'start': 8.118, 'end': 8.318, 'score': 0.958}, {'word': 'to', 'start': 8.358, 'end': 8.438, 'score': 0.88}, {'word': 'tell', 'start': 8.498, 'end': 8.699, 'score': 0.712}, {'word': 'the', 'start': 8.739, 'end': 8.819, 'score': 0.828}, {'word': 'depth', 'start': 8.859, 'end': 9.119, 'score': 0.859}, {'word': 'of', 'start': 9.179, 'end': 9.279, 'score': 0.796}, {'word': 'a', 'start': 9.319, 'end': 9.339, 'score': 0.767}, {'word': 'well.', 'start': 9.399, 'end': 9.659, 'score': 0.933}]}, {'start': 10.9, 'end': 12.841, 'text': 'These days a chicken leg is a rare dish.', 'words': [{'word': 'These', 'start': 10.9, 'end': 11.12, 'score': 0.856}, {'word': 'days', 'start': 11.16, 'end': 11.36, 'score': 0.87}, {'word': 'a', 'start': 11.4, 'end': 11.44, 'score': 0.515}, {'word': 'chicken', 'start': 11.48, 'end': 11.78, 'score': 0.932}, {'word': 'leg', 'start': 11.82, 'end': 12.0, 'score': 0.993}, {'word': 'is', 'start': 12.04, 'end': 12.121, 'score': 0.76}, {'word': 'a', 'start': 12.181, 'end': 12.221, 'score': 0.499}, {'word': 'rare', 'start': 12.281, 'end': 12.501, 'score': 0.776}, {'word': 'dish.', 'start': 12.581, 'end': 12.841, 'score': 0.878}]}, {'start': 14.282, 'end': 16.123, 'text': 'Rice is often served in round bowls.', 'words': [{'word': 'Rice', 'start': 14.282, 'end': 14.522, 'score': 0.867}, {'word': 'is', 'start': 14.582, 'end': 14.662, 'score': 0.638}, {'word': 'often', 'start': 14.722, 'end': 15.022, 'score': 0.922}, {'word': 'served', 'start': 15.082, 'end': 15.362, 'score': 0.848}, {'word': 'in', 'start': 15.422, 'end': 15.502, 'score': 0.85}, {'word': 'round', 'start': 15.562, 'end': 15.783, 'score': 0.912}, {'word': 'bowls.', 'start': 15.823, 'end': 16.123, 'score': 0.647}]}, {'start': 17.343, 'end': 19.265, 'text': 'The juice of lemons makes fine punch.', 'words': [{'word': 'The', 'start': 17.343, 'end': 17.464, 'score': 0.796}, {'word': 'juice', 'start': 17.504, 'end': 17.764, 'score': 0.976}, {'word': 'of', 'start': 17.804, 'end': 17.884, 'score': 0.83}, {'word': 'lemons', 'start': 17.944, 'end': 18.264, 'score': 0.914}, {'word': 'makes', 'start': 18.344, 'end': 18.564, 'score': 0.866}, {'word': 'fine', 'start': 18.644, 'end': 18.904, 'score': 0.914}, {'word': 'punch.', 'start': 18.964, 'end': 19.265, 'score': 0.888}]}, {'start': 20.445, 'end': 22.406, 'text': 'The box was thrown beside the parked truck.', 'words': [{'word': 'The', 'start': 20.445, 'end': 20.565, 'score': 0.89}, {'word': 'box', 'start': 20.605, 'end': 20.885, 'score': 0.956}, {'word': 'was', 'start': 20.926, 'end': 21.046, 'score': 0.907}, {'word': 'thrown', 'start': 21.106, 'end': 21.346, 'score': 0.621}, {'word': 'beside', 'start': 21.386, 'end': 21.706, 'score': 0.901}, {'word': 'the', 'start': 21.746, 'end': 21.806, 'score': 0.977}, {'word': 'parked', 'start': 21.866, 'end': 22.086, 'score': 0.65}, {'word': 'truck.', 'start': 22.126, 'end': 22.406, 'score': 0.859}]}, {'start': 23.767, 'end': 25.748, 'text': 'The hogs were fed chopped corn and garbage.', 'words': [{'word': 'The', 'start': 23.767, 'end': 23.867, 'score': 0.997}, {'word': 'hogs', 'start': 23.907, 'end': 24.147, 'score': 0.873}, {'word': 'were', 'start': 24.167, 'end': 24.287, 'score': 0.874}, {'word': 'fed', 'start': 24.347, 'end': 24.588, 'score': 0.763}, {'word': 'chopped', 'start': 24.628, 'end': 24.928, 'score': 0.671}, {'word': 'corn', 'start': 24.968, 'end': 25.208, 'score': 0.843}, {'word': 'and', 'start': 25.248, 'end': 25.328, 'score': 0.923}, {'word': 'garbage.', 'start': 25.348, 'end': 25.748, 'score': 0.902}]}, {'start': 27.129, 'end': 28.73, 'text': 'Four hours of study work faced us.', 'words': [{'word': 'Four', 'start': 27.129, 'end': 27.329, 'score': 0.819}, {'word': 'hours', 'start': 27.369, 'end': 27.629, 'score': 0.805}, {'word': 'of', 'start': 27.669, 'end': 27.709, 'score': 0.735}, {'word': 'study', 'start': 27.749, 'end': 28.01, 'score': 0.873}, {'word': 'work', 'start': 28.05, 'end': 28.25, 'score': 0.885}, {'word': 'faced', 'start': 28.29, 'end': 28.57, 'score': 0.97}, {'word': 'us.', 'start': 28.67, 'end': 28.73, 'score': 0.99}]}, {'start': 30.111, 'end': 32.092, 'text': ' A large size in stockings is hard to sell.', 'words': [{'word': 'A', 'start': 30.111, 'end': 30.171, 'score': 0.927}, {'word': 'large', 'start': 30.212, 'end': 30.454, 'score': 0.968}, {'word': 'size', 'start': 30.515, 'end': 30.758, 'score': 0.982}, {'word': 'in', 'start': 30.798, 'end': 30.879, 'score': 0.691}, {'word': 'stockings', 'start': 30.919, 'end': 31.344, 'score': 0.923}, {'word': 'is', 'start': 31.405, 'end': 31.486, 'score': 0.816}, {'word': 'hard', 'start': 31.526, 'end': 31.708, 'score': 0.834}, {'word': 'to', 'start': 31.748, 'end': 31.85, 'score': 0.938}, {'word': 'sell.', 'start': 31.89, 'end': 32.092, 'score': 0.954}]}]
                             segment label  ... intersection      union
0  [ 00:00:00.486 -->  00:00:03.000]     A  ...   -28.889031  31.605406
1  [ 00:00:04.266 -->  00:00:06.392]     B  ...   -25.497156  27.825406
2  [ 00:00:07.776 -->  00:00:09.683]     C  ...   -22.206531  24.315406
3  [ 00:00:10.847 -->  00:00:12.923]     D  ...   -18.966531  21.244156
4  [ 00:00:14.205 -->  00:00:16.163]     E  ...   -15.726531  17.886031
5  [ 00:00:17.294 -->  00:00:19.319]     F  ...   -12.570906  14.797906
6  [ 00:00:20.399 -->  00:00:22.390]     G  ...    -9.499656  11.692906
7  [ 00:00:23.723 -->  00:00:25.849]     H  ...    -6.040281   8.368531
8  [ 00:00:27.064 -->  00:00:28.769]     I  ...    -3.120906   5.027281
9  [ 00:00:30.017 -->  00:00:32.194]     J  ...     0.202000   2.176875

[10 rows x 7 columns]
[{'start': 0.694, 'end': 2.995, 'text': ' Birch canoes slid on the smooth planks.', 'words': [{'word': 'Birch', 'start': 0.694, 'end': 1.034, 'score': 0.854, 'speaker': 'SPEAKER_00'}, {'word': 'canoes', 'start': 1.114, 'end': 1.555, 'score': 0.763, 'speaker': 'SPEAKER_00'}, {'word': 'slid', 'start': 1.595, 'end': 1.915, 'score': 0.881, 'speaker': 'SPEAKER_00'}, {'word': 'on', 'start': 2.015, 'end': 2.095, 'score': 0.909, 'speaker': 'SPEAKER_00'}, {'word': 'the', 'start': 2.115, 'end': 2.195, 'score': 0.789, 'speaker': 'SPEAKER_00'}, {'word': 'smooth', 'start': 2.255, 'end': 2.615, 'score': 0.828, 'speaker': 'SPEAKER_00'}, {'word': 'planks.', 'start': 2.695, 'end': 2.995, 'score': 0.861, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 4.296, 'end': 6.357, 'text': 'Glued the sheet to the dark blue background.', 'words': [{'word': 'Glued', 'start': 4.296, 'end': 4.616, 'score': 0.474, 'speaker': 'SPEAKER_00'}, {'word': 'the', 'start': 4.676, 'end': 4.756, 'score': 0.968, 'speaker': 'SPEAKER_00'}, {'word': 'sheet', 'start': 4.796, 'end': 5.016, 'score': 0.933, 'speaker': 'SPEAKER_00'}, {'word': 'to', 'start': 5.056, 'end': 5.157, 'score': 0.776, 'speaker': 'SPEAKER_00'}, {'word': 'the', 'start': 5.177, 'end': 5.237, 'score': 0.952, 'speaker': 'SPEAKER_00'}, {'word': 'dark', 'start': 5.277, 'end': 5.517, 'score': 0.99, 'speaker': 'SPEAKER_00'}, {'word': 'blue', 'start': 5.577, 'end': 5.777, 'score': 0.844, 'speaker': 'SPEAKER_00'}, {'word': 'background.', 'start': 5.837, 'end': 6.357, 'score': 0.93, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 7.838, 'end': 9.659, 'text': 'It is easy to tell the depth of a well.', 'words': [{'word': 'It', 'start': 7.838, 'end': 7.918, 'score': 0.932, 'speaker': 'SPEAKER_00'}, {'word': 'is', 'start': 7.978, 'end': 8.058, 'score': 0.724, 'speaker': 'SPEAKER_00'}, {'word': 'easy', 'start': 8.118, 'end': 8.318, 'score': 0.958, 'speaker': 'SPEAKER_00'}, {'word': 'to', 'start': 8.358, 'end': 8.438, 'score': 0.88, 'speaker': 'SPEAKER_00'}, {'word': 'tell', 'start': 8.498, 'end': 8.699, 'score': 0.712, 'speaker': 'SPEAKER_00'}, {'word': 'the', 'start': 8.739, 'end': 8.819, 'score': 0.828, 'speaker': 'SPEAKER_00'}, {'word': 'depth', 'start': 8.859, 'end': 9.119, 'score': 0.859, 'speaker': 'SPEAKER_00'}, {'word': 'of', 'start': 9.179, 'end': 9.279, 'score': 0.796, 'speaker': 'SPEAKER_00'}, {'word': 'a', 'start': 9.319, 'end': 9.339, 'score': 0.767, 'speaker': 'SPEAKER_00'}, {'word': 'well.', 'start': 9.399, 'end': 9.659, 'score': 0.933, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 10.9, 'end': 12.841, 'text': 'These days a chicken leg is a rare dish.', 'words': [{'word': 'These', 'start': 10.9, 'end': 11.12, 'score': 0.856, 'speaker': 'SPEAKER_00'}, {'word': 'days', 'start': 11.16, 'end': 11.36, 'score': 0.87, 'speaker': 'SPEAKER_00'}, {'word': 'a', 'start': 11.4, 'end': 11.44, 'score': 0.515, 'speaker': 'SPEAKER_00'}, {'word': 'chicken', 'start': 11.48, 'end': 11.78, 'score': 0.932, 'speaker': 'SPEAKER_00'}, {'word': 'leg', 'start': 11.82, 'end': 12.0, 'score': 0.993, 'speaker': 'SPEAKER_00'}, {'word': 'is', 'start': 12.04, 'end': 12.121, 'score': 0.76, 'speaker': 'SPEAKER_00'}, {'word': 'a', 'start': 12.181, 'end': 12.221, 'score': 0.499, 'speaker': 'SPEAKER_00'}, {'word': 'rare', 'start': 12.281, 'end': 12.501, 'score': 0.776, 'speaker': 'SPEAKER_00'}, {'word': 'dish.', 'start': 12.581, 'end': 12.841, 'score': 0.878, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 14.282, 'end': 16.123, 'text': 'Rice is often served in round bowls.', 'words': [{'word': 'Rice', 'start': 14.282, 'end': 14.522, 'score': 0.867, 'speaker': 'SPEAKER_00'}, {'word': 'is', 'start': 14.582, 'end': 14.662, 'score': 0.638, 'speaker': 'SPEAKER_00'}, {'word': 'often', 'start': 14.722, 'end': 15.022, 'score': 0.922, 'speaker': 'SPEAKER_00'}, {'word': 'served', 'start': 15.082, 'end': 15.362, 'score': 0.848, 'speaker': 'SPEAKER_00'}, {'word': 'in', 'start': 15.422, 'end': 15.502, 'score': 0.85, 'speaker': 'SPEAKER_00'}, {'word': 'round', 'start': 15.562, 'end': 15.783, 'score': 0.912, 'speaker': 'SPEAKER_00'}, {'word': 'bowls.', 'start': 15.823, 'end': 16.123, 'score': 0.647, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 17.343, 'end': 19.265, 'text': 'The juice of lemons makes fine punch.', 'words': [{'word': 'The', 'start': 17.343, 'end': 17.464, 'score': 0.796, 'speaker': 'SPEAKER_00'}, {'word': 'juice', 'start': 17.504, 'end': 17.764, 'score': 0.976, 'speaker': 'SPEAKER_00'}, {'word': 'of', 'start': 17.804, 'end': 17.884, 'score': 0.83, 'speaker': 'SPEAKER_00'}, {'word': 'lemons', 'start': 17.944, 'end': 18.264, 'score': 0.914, 'speaker': 'SPEAKER_00'}, {'word': 'makes', 'start': 18.344, 'end': 18.564, 'score': 0.866, 'speaker': 'SPEAKER_00'}, {'word': 'fine', 'start': 18.644, 'end': 18.904, 'score': 0.914, 'speaker': 'SPEAKER_00'}, {'word': 'punch.', 'start': 18.964, 'end': 19.265, 'score': 0.888, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 20.445, 'end': 22.406, 'text': 'The box was thrown beside the parked truck.', 'words': [{'word': 'The', 'start': 20.445, 'end': 20.565, 'score': 0.89, 'speaker': 'SPEAKER_00'}, {'word': 'box', 'start': 20.605, 'end': 20.885, 'score': 0.956, 'speaker': 'SPEAKER_00'}, {'word': 'was', 'start': 20.926, 'end': 21.046, 'score': 0.907, 'speaker': 'SPEAKER_00'}, {'word': 'thrown', 'start': 21.106, 'end': 21.346, 'score': 0.621, 'speaker': 'SPEAKER_00'}, {'word': 'beside', 'start': 21.386, 'end': 21.706, 'score': 0.901, 'speaker': 'SPEAKER_00'}, {'word': 'the', 'start': 21.746, 'end': 21.806, 'score': 0.977, 'speaker': 'SPEAKER_00'}, {'word': 'parked', 'start': 21.866, 'end': 22.086, 'score': 0.65, 'speaker': 'SPEAKER_00'}, {'word': 'truck.', 'start': 22.126, 'end': 22.406, 'score': 0.859, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 23.767, 'end': 25.748, 'text': 'The hogs were fed chopped corn and garbage.', 'words': [{'word': 'The', 'start': 23.767, 'end': 23.867, 'score': 0.997, 'speaker': 'SPEAKER_00'}, {'word': 'hogs', 'start': 23.907, 'end': 24.147, 'score': 0.873, 'speaker': 'SPEAKER_00'}, {'word': 'were', 'start': 24.167, 'end': 24.287, 'score': 0.874, 'speaker': 'SPEAKER_00'}, {'word': 'fed', 'start': 24.347, 'end': 24.588, 'score': 0.763, 'speaker': 'SPEAKER_00'}, {'word': 'chopped', 'start': 24.628, 'end': 24.928, 'score': 0.671, 'speaker': 'SPEAKER_00'}, {'word': 'corn', 'start': 24.968, 'end': 25.208, 'score': 0.843, 'speaker': 'SPEAKER_00'}, {'word': 'and', 'start': 25.248, 'end': 25.328, 'score': 0.923, 'speaker': 'SPEAKER_00'}, {'word': 'garbage.', 'start': 25.348, 'end': 25.748, 'score': 0.902, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 27.129, 'end': 28.73, 'text': 'Four hours of study work faced us.', 'words': [{'word': 'Four', 'start': 27.129, 'end': 27.329, 'score': 0.819, 'speaker': 'SPEAKER_00'}, {'word': 'hours', 'start': 27.369, 'end': 27.629, 'score': 0.805, 'speaker': 'SPEAKER_00'}, {'word': 'of', 'start': 27.669, 'end': 27.709, 'score': 0.735, 'speaker': 'SPEAKER_00'}, {'word': 'study', 'start': 27.749, 'end': 28.01, 'score': 0.873, 'speaker': 'SPEAKER_00'}, {'word': 'work', 'start': 28.05, 'end': 28.25, 'score': 0.885, 'speaker': 'SPEAKER_00'}, {'word': 'faced', 'start': 28.29, 'end': 28.57, 'score': 0.97, 'speaker': 'SPEAKER_00'}, {'word': 'us.', 'start': 28.67, 'end': 28.73, 'score': 0.99, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 30.111, 'end': 32.092, 'text': ' A large size in stockings is hard to sell.', 'words': [{'word': 'A', 'start': 30.111, 'end': 30.171, 'score': 0.927, 'speaker': 'SPEAKER_00'}, {'word': 'large', 'start': 30.212, 'end': 30.454, 'score': 0.968, 'speaker': 'SPEAKER_00'}, {'word': 'size', 'start': 30.515, 'end': 30.758, 'score': 0.982, 'speaker': 'SPEAKER_00'}, {'word': 'in', 'start': 30.798, 'end': 30.879, 'score': 0.691, 'speaker': 'SPEAKER_00'}, {'word': 'stockings', 'start': 30.919, 'end': 31.344, 'score': 0.923, 'speaker': 'SPEAKER_00'}, {'word': 'is', 'start': 31.405, 'end': 31.486, 'score': 0.816, 'speaker': 'SPEAKER_00'}, {'word': 'hard', 'start': 31.526, 'end': 31.708, 'score': 0.834, 'speaker': 'SPEAKER_00'}, {'word': 'to', 'start': 31.748, 'end': 31.85, 'score': 0.938, 'speaker': 'SPEAKER_00'}, {'word': 'sell.', 'start': 31.89, 'end': 32.092, 'score': 0.954, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}]

Process finished with exit code 0

Future work

  • Silero ONNX model usage (silero-vad repo & faster-whisper for inspiration) to enable GPU usage and harvest possible benefits.
  • Expose additional VAD settings to the user. These settings may have common meaning among the various VAD methods. E.g.:
    • min_silence_duration_ms (silero) and min_duration_off (pyannote)
    • min_speech_duration_ms (silero) and min_duration_on (pyannote)

@3manifold 3manifold marked this pull request as ready for review September 26, 2024 08:35
onset: float = 0.5,
offset: Optional[float] = None,
):
assert chunk_size > 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep binarization separate from the parent class function merge_chunks (i.e. Vad.merge_chunks). This is because binarization of other VAD methods (e.g. silero) may happen in earlier stages making Vad.merge_chunks easier to reuse. Specifically, in the case of silero, binarization happens during model invocation.

@sulutian

This comment was marked as off-topic.

@3manifold

This comment was marked as off-topic.

@sulutian

This comment was marked as off-topic.

@3manifold

This comment was marked as off-topic.

@sulutian

This comment was marked as off-topic.

@3manifold

This comment was marked as off-topic.

@sulutian

This comment was marked as off-topic.

@sulutian

This comment was marked as off-topic.

@Barabazs
Copy link
Collaborator

Barabazs commented Jan 5, 2025

@m-bain Can you have a look at this PR please? It seems pretty solid to me, but you're the expert here.

@m-bain
Copy link
Owner

m-bain commented Jan 6, 2025

yup will have a look soon!

Copy link
Owner

@m-bain m-bain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work thanks

@m-bain m-bain merged commit 7a98456 into m-bain:main Jan 11, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Silero VAD support
4 participants