The code is based on the paper LipNet: End-to-End Sentence-level Lipreading. LipNet utilizes 3D Convolutions and Recurrent Units to make sentence level prediction by extracting features from the lip movment in the input frames. This implementation provides 3DConv-Bi-LSTM over the 3DConv-GRU model along with a few other models with varying complexity. CTC loss is used to deal with variable length of input alignments (spoken sentences). The model weights are initialised with the same (he) initialization as proposed in the paper.
venv suggested..
python -m venv lipenv
source lipenv/bin/activate
Install gdown for downloading the dataset from drive
pip install gdown
python main.py --epoch 300 \
--lr 0.001 \
--hidden_size 256 \
--model lipnet-lstm \
--batch 16 \
--workers 4
@article{assael2016lipnet,
title={LipNet: End-to-End Sentence-level Lipreading},
author={Assael, Yannis M. and Shillingford, Brendan and Whiteson, Shimon and de Freitas, Nando},
journal={arXiv preprint arXiv:1611.01599},
year={2016}
}