Skip to content

Picovoice/speech-to-text-benchmark

Repository files navigation

Speech-to-Text Benchmark

Made in Vancouver, Canada by Picovoice

This repo is a minimalist and extensible framework for benchmarking different speech-to-text engines.

Table of Contents

Data

Metrics

Word Error Rate

Word error rate (WER) is the ratio of edit distance between words in a reference transcript and the words in the output of the speech-to-text engine to the number of words in the reference transcript.

Core-Hour

The Core-Hour metric is used to evaluate the computational efficiency of the speech-to-text engine, indicating the number of CPU hours required to process one hour of audio. A speech-to-text engine with lower Core-Hour is more computationally efficient. We omit this metric for cloud-based engines.

Model Size

The aggregate size of models (acoustic and language), in MB. We omit this metric for cloud-based engines.

Engines

Usage

This benchmark has been developed and tested on Ubuntu 22.04.

  • Install FFmpeg
  • Download datasets.
  • Install the requirements:
pip3 install -r requirements.txt

In the following, we provide instructions for running the benchmark for each engine. The supported datasets are: COMMON_VOICE, LIBRI_SPEECH_TEST_CLEAN, LIBRI_SPEECH_TEST_OTHER, or TED_LIUM.

Amazon Transcribe Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${AWS_PROFILE} with the name of AWS profile you wish to use.

python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine AMAZON_TRANSCRIBE \
--aws-profile ${AWS_PROFILE}

Azure Speech-to-Text Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, ${AZURE_SPEECH_KEY} and ${AZURE_SPEECH_LOCATION} information from your Azure account.

python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine AZURE_SPEECH_TO_TEXT \
--azure-speech-key ${AZURE_SPEECH_KEY}
--azure-speech-location ${AZURE_SPEECH_LOCATION}

Google Speech-to-Text Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${GOOGLE_APPLICATION_CREDENTIALS} with credentials download from Google Cloud Platform.

python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine GOOGLE_SPEECH_TO_TEXT \
--google-application-credentials ${GOOGLE_APPLICATION_CREDENTIALS}

IBM Watson Speech-to-Text Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${WATSON_SPEECH_TO_TEXT_API_KEY}/${${WATSON_SPEECH_TO_TEXT_URL}} with credentials from your IBM account.

python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine IBM_WATSON_SPEECH_TO_TEXT \
--watson-speech-to-text-api-key ${WATSON_SPEECH_TO_TEXT_API_KEY}
--watson-speech-to-text-url ${WATSON_SPEECH_TO_TEXT_URL}

OpenAI Whisper Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${WHISPER_MODEL} with the whisper model type (WHISPER_TINY, WHISPER_BASE, WHISPER_SMALL, WHISPER_MEDIUM, or WHISPER_LARGE)

python3 benchmark.py \
--engine ${WHISPER_MODEL} \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \

Picovoice Cheetah Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${PICOVOICE_ACCESS_KEY} with AccessKey obtained from Picovoice Console.

python3 benchmark.py \
--engine PICOVOICE_CHEETAH \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}

Picovoice Leopard Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${PICOVOICE_ACCESS_KEY} with AccessKey obtained from Picovoice Console.

python3 benchmark.py \
--engine PICOVOICE_LEOPARD \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}

Results

Word Error Rate (WER)

Engine LibriSpeech test-clean LibriSpeech test-other TED-LIUM CommonVoice Average
Amazon Transcribe 2.6% 5.6% 3.8% 8.7% 5.2%
Azure Speech-to-Text 2.8% 6.2% 4.6% 8.9% 5.6%
Google Speech-to-Text 10.8% 24.5% 14.4% 31.9% 20.4%
Google Speech-to-Text Enhanced 6.2% 13.0% 6.1% 18.2% 10.9%
IBM Watson Speech-to-Text 10.9% 26.2% 11.7% 39.4% 22.0%
Whisper Large (Multilingual) 3.7% 5.4% 4.6% 9.0% 5.7%
Whisper Medium 3.3% 6.2% 4.6% 10.2% 6.1%
Whisper Small 3.3% 7.2% 4.8% 12.7% 7.0%
Whisper Base 4.3% 10.4% 5.4% 17.9% 9.5%
Whisper Tiny 5.9% 13.8% 6.5% 24.4% 12.7%
Picovoice Cheetah 5.6% 12.1% 7.7% 17.5% 10.7%
Picovoice Leopard 5.3% 11.3% 7.2% 16.2% 10.0%

Core-Hour & Model Size

To obtain these results, we ran the benchmark across the entire TED-LIUM dataset and recorded the processing time. The measurement is carried out on an Ubuntu 22.04 machine with AMD CPU (AMD Ryzen 9 5900X (12) @ 3.70GHz), 64 GB of RAM, and NVMe storage, using 10 cores simultaneously. We omit Whisper Large (Multilingual) from this benchmark.

Engine Core-Hour Model Size / MB
Whisper Medium 1.50 1457
Whisper Small 0.89 462
Whisper Base 0.28 139
Whisper Tiny 0.15 73
Picovoice Leopard 0.05 36
Picovoice Cheetah 0.09 31