Skip to content

Commit b666523

Browse files
authored
add v3 pre-release comment, and v4 progress update
1 parent 69e038c commit b666523

File tree

1 file changed

+9
-41
lines changed

1 file changed

+9
-41
lines changed

README.md

Lines changed: 9 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,8 @@ This repository refines the timestamps of openAI's Whisper model via forced alig
5252

5353
<h2 align="left", id="highlights">New🚨</h2>
5454

55-
- v2 released, code cleanup, imports whisper library, batched inference from paper not included (contact for licensing / batched model API). VAD filtering is now turned on by default, as in the paper.
55+
- v3 pre-release [this branch](https://github.com/m-bain/whisperX/tree/v3) *70x speed-up open-sourced. Using batched whisper with faster-whisper backend*!
56+
- v2 released, code cleanup, imports whisper library. VAD filtering is now turned on by default, as in the paper.
5657
- Paper drop🎓👨‍🏫! Please see our [ArxiV preprint](https://arxiv.org/abs/2303.00747) for benchmarking and details of WhisperX. We also introduce more efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (not provided in this repo).
5758
- VAD filtering: Voice Activity Detection (VAD) from [Pyannote.audio](https://huggingface.co/pyannote/voice-activity-detection) is used as a preprocessing step to remove reliance on whisper timestamps and only transcribe audio segments containing speech. add `--vad_filter True` flag, increases timestamp accuracy and robustness (requires more GPU mem due to 30s inputs in wav2vec2)
5859
- Character level timestamps (see `*.char.ass` file output)
@@ -179,7 +180,7 @@ In addition to forced alignment, the following two modifications have been made
179180

180181
If you are multilingual, a major way you can contribute to this project is to find phoneme models on huggingface (or train your own) and test them on speech for the target language. If the results look good send a merge request and some examples showing its success.
181182

182-
The next major upgrade we are working on is whisper with speaker diarization, so if you have any experience on this please share.
183+
Bug finding and pull requests are also highly appreciated to keep this project going, since it's already diverging from the original research scope.
183184

184185
<h2 align="left" id="coming-soon">Coming Soon 🗓</h2>
185186

@@ -195,16 +196,14 @@ The next major upgrade we are working on is whisper with speaker diarization, so
195196

196197
* [x] Incorporating speaker diarization
197198

198-
* [ ] Automatic .wav conversion to make VAD compatible
199+
* [x] Model flush, for low gpu mem resources
199200

200-
* [ ] Model flush, for low gpu mem resources
201-
202-
* [ ] Improve diarization (word level). *Harder than first thought...*
201+
* [ ] Improve diarization (word level). *Harder than first thought... see #below*
203202

204203

205204
<h2 align="left" id="contact">Contact/Support 📇</h2>
206205

207-
Contact [email protected] for queries and licensing / early access to a model API with batched inference (transcribe 1hr audio in under 1min).
206+
Contact [email protected] for queries. WhisperX v4 development is underway with *with siginificantly improved diarization*. To support v4 and get early access, get in touch.
208207

209208
<a href="https://www.buymeacoffee.com/maxhbain" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/default-orange.png" alt="Buy Me A Coffee" height="41" width="174"></a>
210209

@@ -216,7 +215,9 @@ This work, and my PhD, is supported by the [VGG (Visual Geometry Group)](https:/
216215

217216

218217
Of course, this is builds on [openAI's whisper](https://github.com/openai/whisper).
219-
And borrows important alignment code from [PyTorch tutorial on forced alignment](https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html)
218+
Borrows important alignment code from [PyTorch tutorial on forced alignment](https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html)
219+
And uses the wonderful pyannote VAD / Diarization https://github.com/pyannote/pyannote-audio
220+
220221

221222

222223
<h2 align="left" id="cite">Citation</h2>
@@ -230,36 +231,3 @@ If you use this in your research, please cite the paper:
230231
year={2023}
231232
}
232233
```
233-
234-
as well the following works, used in each stage of the pipeline:
235-
236-
```bibtex
237-
@article{radford2022robust,
238-
title={Robust speech recognition via large-scale weak supervision},
239-
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
240-
journal={arXiv preprint arXiv:2212.04356},
241-
year={2022}
242-
}
243-
```
244-
245-
```bibtex
246-
@article{baevski2020wav2vec,
247-
title={wav2vec 2.0: A framework for self-supervised learning of speech representations},
248-
author={Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael},
249-
journal={Advances in neural information processing systems},
250-
volume={33},
251-
pages={12449--12460},
252-
year={2020}
253-
}
254-
```
255-
256-
```bibtex
257-
@inproceedings{bredin2020pyannote,
258-
title={Pyannote. audio: neural building blocks for speaker diarization},
259-
author={Bredin, Herv{\'e} and Yin, Ruiqing and Coria, Juan Manuel and Gelly, Gregory and Korshunov, Pavel and Lavechin, Marvin and Fustes, Diego and Titeux, Hadrien and Bouaziz, Wassim and Gill, Marie-Philippe},
260-
booktitle={ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
261-
pages={7124--7128},
262-
year={2020},
263-
organization={IEEE}
264-
}
265-
```

0 commit comments

Comments
 (0)