You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+9-41Lines changed: 9 additions & 41 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -52,7 +52,8 @@ This repository refines the timestamps of openAI's Whisper model via forced alig
52
52
53
53
<h2align="left",id="highlights">New🚨</h2>
54
54
55
-
- v2 released, code cleanup, imports whisper library, batched inference from paper not included (contact for licensing / batched model API). VAD filtering is now turned on by default, as in the paper.
55
+
- v3 pre-release [this branch](https://github.com/m-bain/whisperX/tree/v3)*70x speed-up open-sourced. Using batched whisper with faster-whisper backend*!
56
+
- v2 released, code cleanup, imports whisper library. VAD filtering is now turned on by default, as in the paper.
56
57
- Paper drop🎓👨🏫! Please see our [ArxiV preprint](https://arxiv.org/abs/2303.00747) for benchmarking and details of WhisperX. We also introduce more efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (not provided in this repo).
57
58
- VAD filtering: Voice Activity Detection (VAD) from [Pyannote.audio](https://huggingface.co/pyannote/voice-activity-detection) is used as a preprocessing step to remove reliance on whisper timestamps and only transcribe audio segments containing speech. add `--vad_filter True` flag, increases timestamp accuracy and robustness (requires more GPU mem due to 30s inputs in wav2vec2)
58
59
- Character level timestamps (see `*.char.ass` file output)
@@ -179,7 +180,7 @@ In addition to forced alignment, the following two modifications have been made
179
180
180
181
If you are multilingual, a major way you can contribute to this project is to find phoneme models on huggingface (or train your own) and test them on speech for the target language. If the results look good send a merge request and some examples showing its success.
181
182
182
-
The next major upgrade we are working on is whisper with speaker diarization, so if you have any experience on this please share.
183
+
Bug finding and pull requests are also highly appreciated to keep this project going, since it's already diverging from the original research scope.
Contact [email protected] for queries and licensing / early access to a model API with batched inference (transcribe 1hr audio in under 1min).
206
+
Contact [email protected] for queries. WhisperX v4 development is underway with *with siginificantly improved diarization*. To support v4 and get early access, get in touch.
208
207
209
208
<ahref="https://www.buymeacoffee.com/maxhbain"target="_blank"><imgsrc="https://cdn.buymeacoffee.com/buttons/default-orange.png"alt="Buy Me A Coffee"height="41"width="174"></a>
210
209
@@ -216,7 +215,9 @@ This work, and my PhD, is supported by the [VGG (Visual Geometry Group)](https:/
216
215
217
216
218
217
Of course, this is builds on [openAI's whisper](https://github.com/openai/whisper).
219
-
And borrows important alignment code from [PyTorch tutorial on forced alignment](https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html)
218
+
Borrows important alignment code from [PyTorch tutorial on forced alignment](https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html)
219
+
And uses the wonderful pyannote VAD / Diarization https://github.com/pyannote/pyannote-audio
220
+
220
221
221
222
222
223
<h2align="left"id="cite">Citation</h2>
@@ -230,36 +231,3 @@ If you use this in your research, please cite the paper:
230
231
year={2023}
231
232
}
232
233
```
233
-
234
-
as well the following works, used in each stage of the pipeline:
235
-
236
-
```bibtex
237
-
@article{radford2022robust,
238
-
title={Robust speech recognition via large-scale weak supervision},
239
-
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
240
-
journal={arXiv preprint arXiv:2212.04356},
241
-
year={2022}
242
-
}
243
-
```
244
-
245
-
```bibtex
246
-
@article{baevski2020wav2vec,
247
-
title={wav2vec 2.0: A framework for self-supervised learning of speech representations},
248
-
author={Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael},
249
-
journal={Advances in neural information processing systems},
250
-
volume={33},
251
-
pages={12449--12460},
252
-
year={2020}
253
-
}
254
-
```
255
-
256
-
```bibtex
257
-
@inproceedings{bredin2020pyannote,
258
-
title={Pyannote. audio: neural building blocks for speaker diarization},
259
-
author={Bredin, Herv{\'e} and Yin, Ruiqing and Coria, Juan Manuel and Gelly, Gregory and Korshunov, Pavel and Lavechin, Marvin and Fustes, Diego and Titeux, Hadrien and Bouaziz, Wassim and Gill, Marie-Philippe},
260
-
booktitle={ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
0 commit comments