Skip to content

Latest commit

 

History

History
42 lines (40 loc) · 4.61 KB

alignments.md

File metadata and controls

42 lines (40 loc) · 4.61 KB

Alignments

Problem with alignments

  • The alignments of which have no sil at beginning, and they do have a sil at the beginning in the wave
    • Sometimes, the context-dependent phones eat up the adjacent silence. One way to discourage this from happening is to use the --boost-silence-prob, e.g. 1.25. And make sure you aren't using too many context-dependent phones (num-leaves too large) compared to you data.
    • The WER of sMBR model is fine, although the alignment is not optimal. Models getting best WER do not necessarily produce most precise alignment. That's true especially for the discriminatively trained models and RNNs.
    • --boost-silence = 1.5, it helps to encourage the silence model to eat up more of the data, to avoid it getting modeled inappropritely by context-dependent phones.
    • Alignments are important, use a strong GMM or DNN-based system to generate the alignments.
    • Alignments derived from clean training data also helps.
    • Unless you have some human reference in mind there is no way to evaluate the alignment accuracy.
  • DTW vs. HMM-GMM Forced Alignment
    • DTW does not perform forced alignment, it will not tell you the HMM state corresponding to each frame, since there is no HMM involved in that algorithm.
    • aeneas
  • phoneme to grapheme alignment
  • Disable optional silence
    • Possibly during training you did not put silence in your transcripts in all the places where silence actually appears(including at the beginning and end of utterances). This would force the non-silence phones to learn to model silence also. It might be better to train the system using the optional silence, and the prepare a different lang directory where you disable optional silence, and use that one to do the final alignment. Just make sure the phones.txt is the same.
    • Sometimes, training with e.g. --boost-silence 1.25 can help to avoid the non-silence phones modeling silence. And make sure you're extracting the alignments correctly(should be OK if you're relying on the phoneme boundaries from the alignments, but might not be if you're relying on word labels in lattices and forgetting to do lattice-aligh-words or lattice-align-lexicion).
  • Align transcription with pre-trained model
    • Chain models are not recommended for alignment, and in general you are not seeing much difference between the standard-frame DNN models and SAT GMM models for the task for force alignment. Kaldi standard training framework is to train a gmm sat systems and to align with that to train a subsequent DNN model(chain or not).
  • Data Alignment
    • Indeed, it appears chain models are not the best choice for alignments.
  • nnet3 alignment issues
    • It's about the fact that chain models are not good for alignment, since the objective function they are trained with does not force them to produce good alignments, and (b)LSTM does not always produce good alignments. Regardless, for force alignment i would generall recommend to use a GMM based model. It's much faster, and the performance of alignment is not very sensitive to how well a model performs in decoding.
  • WARNING: optional-silence SIL is seen only 69.4736842105% of the time at utterance end. This may not be optimal
  • context dependency
    • Thins like context dependency could in theory make the alignments different from what a human would say.
  • GMM-HMM
    • GMM-HMM has its own limitation, and 1k hours data won't help much.
  • lattice for mmi training