whisper.cpp accuracy #1035

gauvainjl · 2023-06-21T23:31:45Z

gauvainjl
Jun 21, 2023

Hello,

I am curious to know if WER numbers have been reported using whisper.cpp versus openai python code? I compared both codes on some broadcast data and found the that the WER with whisper.cpp is 50% higher than with the openai code (using the same model). Is there any way to tune whisper.cpp to get a comparable WER to the openai code?

JLG

Answered by bobqianic

Aug 12, 2023

I found older messages about the same topic. I understand that whisper.cpp does not implement the same decoding strategy as openai code, meaning we should not expect the same accuracy. A 30% relative difference in the WER is in fact a huge difference when comparing different decodings with the same model. Differences in decoding for speech recognizers usually does not impact the WER by more than a few percent relative. Are there any plans to reduce this large gap?

We've wrapped up our analysis comparing the log_mel_spectrogram generation between whisper.cpp and OpenAI's Whisper.

To summarize the main issues we found in whisper.cpp:

The Stage-1 padding (zero padding) is inadequate. Whi…

View full answer

neurostar · 2023-06-22T16:30:25Z

neurostar
Jun 22, 2023

The default parameters of whisper.cpp and openai are different. If you have not, test openai with beam_size=1, best_of=2, temperature=[0.0, 0.4, 0.8] which are whisper.cpp default parameters. While it is not identical, I get fairly similar accuracy qualitatively.

1 reply

gauvainjl Jun 23, 2023
Author

Thank you for your help. I would like to get the lowest WER, so I changed the default parameter values of whisper.cpp to match the openai setting, i.e. beam_size=5, best_of=5 and temp_inc=0.2. With this setting, the difference in WER is reduced ,but it still quite significant. The WER is 30% higher (relative) using whisper.cpp. Are there other parameters that I should change to reduce this difference?

gauvainjl · 2023-06-25T09:51:23Z

gauvainjl
Jun 25, 2023
Author

I found older messages about the same topic. I understand that whisper.cpp does not implement the same decoding strategy as openai code, meaning we should not expect the same accuracy. A 30% relative difference in the WER is in fact a huge difference when comparing different decodings with the same model. Differences in decoding for speech recognizers usually does not impact the WER by more than a few percent relative. Are there any plans to reduce this large gap?

3 replies

bobqianic Aug 5, 2023
Collaborator

@gauvainjl Could you share your testing method with us? We're currently working on resolving this issue. #1148

gauvainjl Aug 5, 2023
Author

Sure. I can share the method but not the data as I don't own it. I used a data set of 4.4 hours of manually transribed English broadcast speech. The data includes various sources and accents with about 300 speakers. I use the sclite NIST software for scoring and compared the WER of whisper.cpp and of openai code.

bobqianic Aug 5, 2023
Collaborator

Thanks for sharing your method!

bobqianic · 2023-08-12T02:20:27Z

bobqianic
Aug 12, 2023
Collaborator

I found older messages about the same topic. I understand that whisper.cpp does not implement the same decoding strategy as openai code, meaning we should not expect the same accuracy. A 30% relative difference in the WER is in fact a huge difference when comparing different decodings with the same model. Differences in decoding for speech recognizers usually does not impact the WER by more than a few percent relative. Are there any plans to reduce this large gap?

We've wrapped up our analysis comparing the log_mel_spectrogram generation between whisper.cpp and OpenAI's Whisper.

To summarize the main issues we found in whisper.cpp:

The Stage-1 padding (zero padding) is inadequate. While OpenAI's Whisper uses a padding of 480,000 samples, whisper.cpp only goes between 240,000 and 480,000.
Stage-2 (reflective padding) is missing. This oversight can introduce edge effects when whisper.cpp processes the STFT, potentially causing spectral leakage.
The frame count calculation isn't accurate.
After performing the FFT, it mistakenly aggregates the amplitudes from symmetrical frequency bins.

On top of these, whisper.cpp presents a couple of secondary concerns:

The trig functions in the C++ library default to FP64 computations, in contrast to PyTorch's FP32. This leads to significant discrepancies, especially with smaller angles.
While whisper.cpp offers a feature to shift from mono to a kind of simulated stereo, it's uncertain if the model fully supports stereo audio.
whisper.cpp doesn't remove the last frame like OpenAI's Whisper does, which could lead to some potential issues.

With these findings in hand, we're set to fix whisper.cpp.

Uncovering these issues wouldn't have been possible without the collective support and assistance from everyone. A heartfelt thanks to @regularfry for his invaluable technical expertise in signal processing and for suggesting solutions. I'm grateful to @gauvainjl for introducing the WER testing method. A special shoutout to @ggerganov for his unwavering support. And last but not least, a big thank you to the entire community for their continued attention to this project.

Full report

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper.cpp accuracy #1035

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

whisper.cpp accuracy #1035

gauvainjl Jun 21, 2023

To summarize the main issues we found in whisper.cpp:

Replies: 3 comments · 4 replies

neurostar Jun 22, 2023

gauvainjl Jun 23, 2023 Author

gauvainjl Jun 25, 2023 Author

bobqianic Aug 5, 2023 Collaborator

gauvainjl Aug 5, 2023 Author

bobqianic Aug 5, 2023 Collaborator

bobqianic Aug 12, 2023 Collaborator

To summarize the main issues we found in whisper.cpp:

On top of these, whisper.cpp presents a couple of secondary concerns:

With these findings in hand, we're set to fix whisper.cpp.

gauvainjl
Jun 21, 2023

Replies: 3 comments 4 replies

neurostar
Jun 22, 2023

gauvainjl Jun 23, 2023
Author

gauvainjl
Jun 25, 2023
Author

bobqianic Aug 5, 2023
Collaborator

gauvainjl Aug 5, 2023
Author

bobqianic Aug 5, 2023
Collaborator

bobqianic
Aug 12, 2023
Collaborator