GitHub - Amg9794/Utterance-speech--to-Phoneme-Mapping: Developed a model for predicting phonemes in speech utterances with unaligned phoneme labels. • Used beam search for decoding the prediction and evaluated performance using character-level string edit distance.

Run:

Please run python hw3.py --train --bi. The rest hyperparameters are set to default in my script.

The best performance is achieved from the following architecture:

A wide variant of ResNet: the first convolutional layer is with stride 1 and kernel size 3x3, the output dimension is 512; I used 3 sets of layers, expanding the channel dimension from 512 to 2048.
A 1D bidirectional rnn operating on the time dimension. The hidden size is 1024, with a dropout probability of 0.5.
A final dense layer.

I used CTCLoss and the Adam Optimizer.

I used ctcdecoder (beam search).

The learning rate is 2e-3, the weight decay is 5e-5; batch size is 32, I also used a scheduler to cut the lr in half every 5 epochs.

I tested different model architectures, optimizers, and various hyperparameters. The models I tested can be found in models.py

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitattributes		.gitattributes
README.md		README.md
base.py		base.py
check_data.py		check_data.py
hw3.py		hw3.py
models.py		models.py
phoneme_list.py		phoneme_list.py