Skip to content

Developed a model for predicting phonemes in speech utterances with unaligned phoneme labels. • Used beam search for decoding the prediction and evaluated performance using character-level string edit distance.

Notifications You must be signed in to change notification settings

Amg9794/Utterance-speech--to-Phoneme-Mapping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Run:

Please run python hw3.py --train --bi. The rest hyperparameters are set to default in my script.

Architecture:

The best performance is achieved from the following architecture:

  • A wide variant of ResNet: the first convolutional layer is with stride 1 and kernel size 3x3, the output dimension is 512; I used 3 sets of layers, expanding the channel dimension from 512 to 2048.

  • A 1D bidirectional rnn operating on the time dimension. The hidden size is 1024, with a dropout probability of 0.5.

  • A final dense layer.

Loss Function and Optimizer:

I used CTCLoss and the Adam Optimizer.

Decoder

I used ctcdecoder (beam search).

Other Hyperparameters

The learning rate is 2e-3, the weight decay is 5e-5; batch size is 32, I also used a scheduler to cut the lr in half every 5 epochs.

Other efforts

I tested different model architectures, optimizers, and various hyperparameters. The models I tested can be found in models.py

About

Developed a model for predicting phonemes in speech utterances with unaligned phoneme labels. • Used beam search for decoding the prediction and evaluated performance using character-level string edit distance.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages