Made in Vancouver, Canada by Picovoice
This repo is a minimalist and extensible framework for benchmarking different speech-to-text engines.
Word error rate (WER) is the ratio of edit distance between words in a reference transcript and the words in the output of the speech-to-text engine to the number of words in the reference transcript.
The Core-Hour metric is used to evaluate the computational efficiency of the speech-to-text engine, indicating the number of CPU hours required to process one hour of audio. A speech-to-text engine with lower Core-Hour is more computationally efficient. We omit this metric for cloud-based engines.
The aggregate size of models (acoustic and language), in MB. We omit this metric for cloud-based engines.
- Amazon Transcribe
- Azure Speech-to-Text
- Google Speech-to-Text
- IBM Watson Speech-to-Text
- OpenAI Whisper
- Picovoice Cheetah
- Picovoice Leopard
This benchmark has been developed and tested on Ubuntu 22.04
.
- Install FFmpeg
- Download datasets.
- Install the requirements:
pip3 install -r requirements.txt
In the following, we provide instructions for running the benchmark for each engine. The supported datasets are:
COMMON_VOICE
, LIBRI_SPEECH_TEST_CLEAN
, LIBRI_SPEECH_TEST_OTHER
, or TED_LIUM
.
Replace ${DATASET}
with one of the supported datasets, ${DATASET_FOLDER}
with path to dataset, and ${AWS_PROFILE}
with the name of AWS profile you wish to use.
python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine AMAZON_TRANSCRIBE \
--aws-profile ${AWS_PROFILE}
Replace ${DATASET}
with one of the supported datasets, ${DATASET_FOLDER}
with path to dataset,
${AZURE_SPEECH_KEY}
and ${AZURE_SPEECH_LOCATION}
information from your Azure account.
python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine AZURE_SPEECH_TO_TEXT \
--azure-speech-key ${AZURE_SPEECH_KEY}
--azure-speech-location ${AZURE_SPEECH_LOCATION}
Replace ${DATASET}
with one of the supported datasets, ${DATASET_FOLDER}
with path to dataset, and
${GOOGLE_APPLICATION_CREDENTIALS}
with credentials download from Google Cloud Platform.
python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine GOOGLE_SPEECH_TO_TEXT \
--google-application-credentials ${GOOGLE_APPLICATION_CREDENTIALS}
Replace ${DATASET}
with one of the supported datasets, ${DATASET_FOLDER}
with path to dataset, and
${WATSON_SPEECH_TO_TEXT_API_KEY}
/${${WATSON_SPEECH_TO_TEXT_URL}}
with credentials from your IBM account.
python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine IBM_WATSON_SPEECH_TO_TEXT \
--watson-speech-to-text-api-key ${WATSON_SPEECH_TO_TEXT_API_KEY}
--watson-speech-to-text-url ${WATSON_SPEECH_TO_TEXT_URL}
Replace ${DATASET}
with one of the supported datasets, ${DATASET_FOLDER}
with path to dataset, and
${WHISPER_MODEL}
with the whisper model type (WHISPER_TINY
, WHISPER_BASE
, WHISPER_SMALL
,
WHISPER_MEDIUM
, or WHISPER_LARGE
)
python3 benchmark.py \
--engine ${WHISPER_MODEL} \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
Replace ${DATASET}
with one of the supported datasets, ${DATASET_FOLDER}
with path to dataset, and
${PICOVOICE_ACCESS_KEY}
with AccessKey obtained from Picovoice Console.
python3 benchmark.py \
--engine PICOVOICE_CHEETAH \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}
Replace ${DATASET}
with one of the supported datasets, ${DATASET_FOLDER}
with path to dataset, and
${PICOVOICE_ACCESS_KEY}
with AccessKey obtained from Picovoice Console.
python3 benchmark.py \
--engine PICOVOICE_LEOPARD \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}
Engine | LibriSpeech test-clean | LibriSpeech test-other | TED-LIUM | CommonVoice | Average |
---|---|---|---|---|---|
Amazon Transcribe | 2.6% | 5.6% | 3.8% | 8.7% | 5.2% |
Azure Speech-to-Text | 2.8% | 6.2% | 4.6% | 8.9% | 5.6% |
Google Speech-to-Text | 10.8% | 24.5% | 14.4% | 31.9% | 20.4% |
Google Speech-to-Text Enhanced | 6.2% | 13.0% | 6.1% | 18.2% | 10.9% |
IBM Watson Speech-to-Text | 10.9% | 26.2% | 11.7% | 39.4% | 22.0% |
Whisper Large (Multilingual) | 3.7% | 5.4% | 4.6% | 9.0% | 5.7% |
Whisper Medium | 3.3% | 6.2% | 4.6% | 10.2% | 6.1% |
Whisper Small | 3.3% | 7.2% | 4.8% | 12.7% | 7.0% |
Whisper Base | 4.3% | 10.4% | 5.4% | 17.9% | 9.5% |
Whisper Tiny | 5.9% | 13.8% | 6.5% | 24.4% | 12.7% |
Picovoice Cheetah | 5.6% | 12.1% | 7.7% | 17.5% | 10.7% |
Picovoice Leopard | 5.3% | 11.3% | 7.2% | 16.2% | 10.0% |
To obtain these results, we ran the benchmark across the entire TED-LIUM dataset and recorded the processing time.
The measurement is carried out on an Ubuntu 22.04 machine with AMD CPU (AMD Ryzen 9 5900X (12) @ 3.70GHz
),
64 GB of RAM, and NVMe storage, using 10 cores simultaneously. We omit Whisper Large (Multilingual) from this benchmark.
Engine | Core-Hour | Model Size / MB |
---|---|---|
Whisper Medium | 1.50 | 1457 |
Whisper Small | 0.89 | 462 |
Whisper Base | 0.28 | 139 |
Whisper Tiny | 0.15 | 73 |
Picovoice Leopard | 0.05 | 36 |
Picovoice Cheetah | 0.09 | 31 |