Skip to content

Latest commit

 

History

History
59 lines (40 loc) · 2.27 KB

README.md

File metadata and controls

59 lines (40 loc) · 2.27 KB

TsimaneForcedAligner

A forced aligner for Tsimane language. This repository contains also many interesting things for tsimane, such as a phonemizer, phonetic dictionary, etc. and can be used for other purposes.

Working environment

Clone this github repository:

git clone https://github.com/yaya-sy/TsimaneForcedAligner.git

and move to it:

cd TsimaneForcedAligner

You can create the conda environment if you want to donwnload the bible corpus:

conda env create -f environment.yml

and activate it:

conda activate tsimane-scraper

Aligning the bible corpus

We release the file data/timemarks.txt containing audio timemarks for each verse of the bible corpus. It's a tab-separated file:

filename    verse_line_id   onset   offset

The lines with onset = offset = 0.0 are unaligned verses, you can ignore them.

You can donwload the bible corpus using the script scripts/download_bible.py, as:

python scripts/download_bible.py --page live.bible.is/bible/CASNTM/MRK/1 --output-directory data

Note that the source code of the web page or the links may change, so this scraper may become obsolete.

Align your own corpus

To align a corpus you need:

  • a speech corpus: folder containing your audios and their corresponding texts (they must have the same filenames).
  • a acoustic model: We release a pretrained acoustic model for aligning a new corpus. This model is pretrained on the bible corpus and is located in models/all_non_merged_glottal.zip
  • a phonetic dictionary: it's a vocabulary of the language mapping each word to its phonetic realization. You can find a phonetic dictionary created with the bible corpus of Tsimane in data/vocabularies/bible_vocabulary.dict. But you can also phonemize your own vocabulary using this script: scripts/phonemizer.py

To align your speech corpus, you will need to install the Montreal Forced Aligner.

After installation, you can align your corpus:

mfa align <your-speech-corpus> <your-phonetic-dictionary> models/tsimane_acoustic_model.zip  <output-folder> --clean --overwrite --temp_directory aligners/wnh_tsimane --num_jobs 1