Speech recognition accepts raw audio samples and produces a corresponding character transcription, without an external language model.
Open run.sh
. Set the stage variable to "-1". Set "work_dir" to a
path backed by a disk with at least 30 GB of space. Most space is used
by loadgen logs, not the data or model. You need conda and a C/C++
compiler on your PATH. I used conda 4.8.2. This script is responsible
for downloading dependencies, data, and the model.
Run ./run.sh
from this directory. Note that stage 3 runs all of the
scenarios for the reference implementation, which will take a long
time, so you may want to exist before then.
As you complete individual stages, you can set the variable "stage" to a higher number for restarting from a later stage.
"OpenSLR LibriSpeech Corpus" provides over 1000 hours of speech data in the form of raw audio. We use dev-clean, which is approximately 5 hours. We remove all samples with a length exceeding 15 seconds.
Log filterbanks of size 80 are extracted every 10 milliseconds, from windows of size 20 milliseconds. Note that every three filterbanks are concatenated together ("feature splicing"), so the model's effective frame rate is actually 30 milliseconds.
No dithering takes place.
This is not typical preprocessing, since it takes place as part of the model's measured runtime, not before the model runs.
Look at dev-clean-wav.json generated by run.sh. It looks like this:
[
{
"files": [
{
"channels": 1,
"sample_rate": 16000.0,
"bitrate": 16,
"duration": 6.59,
"num_samples": 105440,
"encoding": "Signed Integer PCM",
"silent": false,
"fname": "dev-clean-wav/2277/149896/2277-149896-0000.wav",
"speed": 1
}
],
"original_duration": 6.59,
"original_num_samples": 105440,
"transcript": "he was in a fevered state of mind owing to the blight his wife's action threatened to cast upon his entire future"
},
{
"files": [
{
"channels": 1,
"sample_rate": 16000.0,
"bitrate": 16,
"duration": 7.145,
"num_samples": 114320,
"encoding": "Signed Integer PCM",
"silent": false,
"fname": "dev-clean-wav/2277/149896/2277-149896-0001.wav",
"speed": 1
}
],
"original_duration": 7.145,
"original_num_samples": 114320,
"transcript": "he would have to pay her the money which she would now regularly demand or there would be trouble it did not matter what he did"
},
...
]
The data is loaded into memory. Then all samples with a duration above 15 seconds are filtered out. Then the first object in the array is assigned query id 0, the second is assigned query id 1, etc. The unfiltered file is uploaded to the directory containing README in case you do not want to recreate this file.
This is a variant of the model described in sections 3.1 and 6.2 of:
@article{, title={STREAMING END-TO-END SPEECH RECOGNITION FOR MOBILE DEVICES}, author={Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, Qiao Liang, Deepti Bhatia, Yuan Shangguan, Bo Li, Golan Pundak, Khe Chai Sim, Tom Bagby, Shuo-yiin Chang, Kanishka Rao, Alexander Gruenstein}, journal={arXiv preprint arXiv:1811.06621}, year={2018} }
The differences are as follows:
- The model has 45.3 million parameters, rather than 120 million parameters
- The LSTMs are not followed by projection layers
- No layer normalization is used
- Hidden dimensions are smaller.
- The prediction network is made of two LSTMs, rather than seven.
- The labels are characters, rather than word pieces.
- No quantization is done at this time for inference.
- A greedy decoder is used, rather than a beamsearch decoder. This greatly reduces inference complexity.
7.452253714852645% Word Error Rate (WER) across all words in the output text of all samples less than 15 seconds in length in the dev-clean set, using a greedy decoder and a fully FP32 model.