About • Installation • How To Use • Credits • License
This repository is a system for training the deepspeech2
model for an ASR task.
Follow these steps to install the project:
-
(Optional) Create and activate new environment using
conda
orvenv
(+pyenv
).a.
conda
version:# create env conda create -n project_env python=PYTHON_VERSION # activate env conda activate project_env
b.
venv
(+pyenv
) version:# create env ~/.pyenv/versions/PYTHON_VERSION/bin/python3 -m venv project_env # alternatively, using default python version python3 -m venv project_env # activate env source project_env
-
Install all required packages
pip install -r requirements.txt
-
Install
pre-commit
:pre-commit install
To train a model, run the following command:
python3 train.py -cn=deepspeech2
Where the model will learn 50 epochs on all datasets from leebspeech
To run inference (evaluate the model or save predictions):
Dowload model:
python3 download_model.py
For predicts on test-clean dataset:
python3 inference.py -cn=inference
For predicts on test-other dataset:
python3 inference.py -cn=inference_other
To calc cer/wer
python3 calc_wer_cer.py --dir_path dir
Where dir is path to your dir (example "/ASR/data/saved/predict/test")
All my graphs with experiments on obtaining my solution can be found here
(there are also separate conclusions for each of the augmentation)
I will keep my course of action in the same order as the graphs are arranged. First of all, I made baseline and one batch test (which will be better later), changed max lr to 1-e3, added log-scaling to spectrograms (at least better perception) and a self-written beam search (in the corresponding graph you can see how it works - goes through all possible options. As proof that my beam search is working correctly, I have displayed it in every training).
You can see that all the graphs give out a strange loss and bad metrics - the mistake was that I incorrectly calculated the length of the output sequences of probabilities.
At this moment, my learning model has the following hyperparameters:
- start lr 1e-4
- max lr 1e-3
- num epochs 50 (200 iter)
- batch size 10
- train dataset: clean-100
- beam size 10
- model parametrs 28086844
I added 4 augmentations: LowPassFilter, HighPassFilter, Color Noise, BandPassFilter. The probability of each one being triggered is about 1/4. The result on clean data turned out to be slightly worse than without it, but I was ready to do it, because then my model would work a little better with "other" data and there would be no overfiting in the future.
My next and final step was to expand the amount of data (use all three datasets) and increase the batch size to 64.
This repository is based on a PyTorch Project Template.