Skip to content

Latest commit

 

History

History
 
 

2019

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

In the paper we are considering:

  • different architectures for acoustic modeling:
    • ResNet
    • TDS
    • Transformer
  • different criterions:
    • Seq2Seq
    • CTC
  • different settings:
    • supervised LibriSpeech 1k hours
    • supervised LibriSpeech 1k hours + unsupervised LibriVox 57k hours (for LibriVox we generate pseudo labels to use them as a target),
  • and different language models:
    • word-piece (ngram, ConvLM)
    • word-based (ngram, ConvLM, transformer)

Data preparation

Run data and auxiliary files (like lexicon, tokens set, etc.) preparation (set necessary paths instead of [...]: data_dst path to data to store, model_dst path to auxiliary path to store).

pip install sentencepiece==0.1.82
python3 ../../utilities/prepare_librispeech_wp_and_official_lexicon.py --data_dst [...] --model_dst [...] --nbest 10 --wp 10000

Besides data the auxiliary files for acoustic and language models training/evaluation will be generated:

cd $MODEL_DST
tree -L 2
.
├── am
│   ├── librispeech-train-all-unigram-10000.model
│   ├── librispeech-train-all-unigram-10000.tokens
│   ├── librispeech-train-all-unigram-10000.vocab
│   ├── librispeech-train+dev-unigram-10000-nbest10.lexicon
│   ├── librispeech-train-unigram-10000-nbest10.lexicon
│   └── train.txt
└── decoder
    ├── 4-gram.arpa
    ├── 4-gram.arpa.lower
    └── decoder-unigram-10000-nbest10.lexicon

Instructions to reproduce training and decoding

  • Detailed language models recipes one can find in the lm directory.
  • To reproduce acoustic models training on Librispeech (1k hours) please go to the librispeech directory.
  • For models trained on Librispeech 1k hours and unsupervised Librilight data (with generated pseudo labels) we release for now models themselves, arch files and train config (full details are coming soon), check librivox directory.
  • Rescoring steps are also coming soon (with Transformer language model for rescoring).

Beam-search decoding

  • Fix the paths inside decode*.cfg
  • Run decoding with decode*.cfg
[...]/wav2letter/build/Decoder --flagsfile path/to/necessary/decode/config --minloglevel=0 --logtostderr=1

Tokens and Lexicon sets

Lexicon Tokens Beam-search lexicon
Lexicon Tokens Beam-search lexicon

Tokens and lexicon files generated in the $MODEL_DST/am/ and $MODEL_DST/decoder/ are the same as in the table.

Pre-trained acoustic models

Below there is info about pre-trained acoustic models, which one can use, for example, to reproduce a decoding step.

Dataset Acoustic model dev-clean Acoustic model dev-other Architecture
LibriSpeech Resnet CTC Resnet CTC Archfile
LibriSpeech + LibriVox Resnet CTC Resnet CTC Archfile
LibriSpeech TDS CTC TDS CTC Archfile
LibriSpeech + LibriVox TDS CTC TDS CTC Archfile
LibriSpeech Transformer CTC Transformer CTC Archfile
LibriSpeech + LibriVox - Transformer CTC Archfile
LibriSpeech TDS Seq2Seq TDS Seq2Seq Archfile
LibriSpeech + LibriVox TDS Seq2Seq TDS Seq2Seq Archfile
LibriSpeech Transformer Seq2Seq Transformer Seq2Seq Archfile
LibriSpeech + LibriVox - Transformer Seq2Seq Archfile

Here architecture files are the same as *.arch,

Pre-trained language models

LM type Language model Vocabulary Architecture LM Fairseq Dict fairseq
ngram word 4-gram - - - -
ngram wp 6-gram - - - -
GCNN word GCNN vocabulary Archfile fairseq fairseq dict
GCNN wp GCNN vocabulary Archfile fairseq fairseq dict
Transformer - - - fairseq fairseq dict

To reproduce decoding step from the paper download these models into $MODEL_DST/am/ and $MODEL_DST/decoder/ appropriately.

Results

Data Model dev-clean WER % test-clean WER % dev-other WER % test-other WER % LM
Librispeech CTC resnet 3.93 4.08 10.13 10.03 -
Librispeech CTC resnet 3.29 3.68 8.56 8.69 word 4-gram
Librispeech CTC resnet 3.00 3.29 7.50 7.53 word GCNN
Librispeech + LibriVox CTC resnet 3.08 3.37 7.80 8.19 -
Librispeech + LibriVox CTC resnet 2.89 3.27 6.97 7.52 word 4-gram
Librispeech CTC TDS 4.22 4.63 11.16 11.16 -
Librispeech CTC TDS 3.49 3.98 9.18 9.53 word 4-gram
Librispeech CTC TDS 2.92 3.40 7.52 8.05 word GCNN
Librispeech + LibriVox CTC TDS 3.01 3.37 7.92 8.23 -
Librispeech + LibriVox CTC TDS 2.87 3.38 7.22 7.63 word 4-gram
Librispeech CTC Transformer 2.99 3.09 7.31 7.40 -
Librispeech CTC Transformer 2.63 2.86 6.20 6.72 word 4-gram
Librispeech CTC Transformer 2.35 2.57 5.29 5.85 word GCNN
Librispeech + LibriVox CTC Transformer - - 6.10 6.51 -
Librispeech + LibriVox CTC Transformer - - 5.69 6.18 word 4-gram
Librispeech Seq2Seq TDS 3.20 3.43 8.20 8.30 -
Librispeech Seq2Seq TDS 2.76 3.18 7.01 7.16 wp 6-gram
Librispeech Seq2Seq TDS 2.54 2.93 6.30 6.43 wp GCNN
Librispeech + LibriVox Seq2Seq TDS 2.00 2.36 4.90 5.27 -
Librispeech + LibriVox Seq2Seq TDS 1.95 2.33 4.55 5.16 wp 6-gram
Librispeech + LibriVox Seq2Seq TDS 1.87 2.20 4.17 4.59 wp GCNN
Librispeech Seq2Seq Transformer 2.54 2.89 6.67 6.98 -
Librispeech Seq2Seq Transformer 2.29 2.72 5.81 6.23 wp 6-gram
Librispeech Seq2Seq Transformer 2.12 2.40 5.20 5.70 wp GCNN
Librispeech + LibriVox Seq2Seq Transformer - - 4.83 5.20 -
Librispeech + LibriVox Seq2Seq Transformer - - 4.45 4.97 wp 6-gram
Librispeech + LibriVox Seq2Seq Transformer - - 3.92 4.55 wp GCNN

Rescoring is coming soon.

Citation

@article{synnaeve2019end,
  title={End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures},
  author={Synnaeve, Gabriel and Xu, Qiantong and Kahn, Jacob and Grave, Edouard and Likhomanenko, Tatiana and Pratap, Vineel and Sriram, Anuroop and Liptchinsky, Vitaliy and Collobert, Ronan},
  journal={arXiv preprint arXiv:1911.08460},
  year={2019}
}