Implementation of the Biron Method for automatic prosodic segmentation of spontaneous speech.
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0250969
First of all, we would like to thank the annotators of the TaRSila project who were tireless in reviewing the automatic transcriptions, training and testing the models for various speech processing systems. This work was carried out at the Artificial Intelligence Center (C4AI-USP), with support from the São Paulo Research Foundation (FAPESP grant nº 2019/07665-4) and IBM Corporation. We also thank the support of the Center of Excellence in Artificial Intelligence (CEIA) funded by the Goiás State Foundation (FAPEG grant no. 201910267000527), the São Paulo University Support Foundation (FUSP) and the National Council for Scientific and Technological Development (CNPq-PQ scholarship, process 304961/2021-3). This project was also supported by the Ministry of Science, Technology and Innovation, with resources from Law nº 8,248, of October 23, 1991, within the scope of PPI-SOFTEX, coordinated by Softex and published Residência no TIC 13, DOU 01245.010222/2022-44.
Giovana Meloni Craveiro, Vinícius Gonçalves Santos, Gabriel Jose Pellisser Dalalana, Flaviane R. Fernandes Svartman, and Sandra Maria Aluísio. 2024. Simple and fast automatic prosodic segmentation of brazilian portuguese spontaneous speech. In Proceedings of the 16th International Conference on Computational Processing of Portuguese (PROPOR 2024), Santiago de Compostela, Galicia. Association for Computational Linguistics. To appear.
- kaldi
- python 3.10.12
- chardet 5.2.0
- tgt 1.4.4
- ufpalign
- re
- os
The following image illustrates the pipeline adopted in this work:
To use the prosodic segmentation code on a certain audio segment, it is necessary to feed it with:
1 - a .txt file containing the transcription of the audio, in which every utterance is separated by speaker
2 - a .TextGrid file containing the timestamps (beginning and ending) of each phone in a layer called "fonemeas" and of each word in a layer called "palavras-grafemas", as generated by UFPAlign (https://github.com/falabrasil/ufpalign/). To obtain it, we suggest using UFPAlign.
Batista, C., Dias, A.L. & Neto, N. Free resources for forced phonetic alignment in Brazilian Portuguese based on Kaldi toolkit. EURASIP J. Adv. Signal Process. 2022, 11 (2022). https://doi.org/10.1186/s13634-022-00844-9
To use UFPAlign (exclusively in Linux), go to their github repository (https://github.com/falabrasil/ufpalign/) and follow the download instructions. In order to successfully configure it in my computer, I followed these steps:
0 - After installing kaldi, and successfully running kaldi's example
1 - Clone UFPAlign github repository into "kaldi/egs"
2 - Download the file path.sh from this address: (https://github.com/falabrasil/kaldi-br/blob/master/fb-ufpalign/path.sh),
3 - Move it to kaldi/egs/ufpalign
4 - Modify the line that contains the path to folder "ufpalign" with the personalized path from your machine
5 - Go to the command line and run source path.sh
.
6 - Then, the example command from UFPAlign's github will work with a single modification, like this: bash ufpalign.sh demo/ex.wav demo/ex.txt mono
7 - Great! Now you can create a folder inside folder 'ufpalign', which contains your .wav audio file (mono and with 16 kHz), its transcription .txt file, and run a command like bash ufpalign.sh yourFolder/yourFile.wav yourFolder/yourFile.txt mono
Every time you open a new terminal window, you should again run source path.sh
before running your actual UFPAlign command.
In case UFPAlign fails to process your files, there are a few things you can try:
1- Guarantee your audio file is 16 kHz, and mono
2- Guarantee your transcription file does not contain double spaces, or any punctuation - UFPAlign will indicate words chars with errors at the command line
3- Use parameter --no-bypass true
4- Use parameters --beam 40 --retry-beam 100
(and gradually increase them, or try different values)
5- Maybe if the audio contains too much noise or is too long, UFPAlign may still have problems with it, so maybe try cutting it or enhancing the audio
To run the prosodic segmentator at your local machine with your personal data, you need to
0- Download file "segmentador_biron.py"
1- In the same folder that you place it, create a folder called "Data", inside it create a folder for each inquiry. Name the folders with the name of the inquiries + "_segmentado". Each inquiry folder must contain a folder for each part, which contains the .TextGrid with timestamps and .txt diarized transcription file for that part.
2- Inside the code you must personalize line inq = "SP_DID_242"
, with the name of your inquiry, then personalize line segments_quantity = 4
with the number of parts that you have and you should be relatively good to go. There might be additional path errors once you run for different inquiries, but the filenames or paths that must be corrected will be indicated for you at the terminal.
3- Then run:
python3 segmentador_byron.py
We're building a web application to host the prosodic segmentation and the forced aligner functionalities. When it is ready, we'll add the link here.
There are 5 inquiries available (SP_D2_012,SP_D2_255,SP_D2_360,SP_DID_242,SP_EF_156).
Each inquiry was divided into segments of around 10 minutes and thus, its folder contain a folder for each of its segments. The names of the files always start with the name of the inquiry and indicate the number of the segment if applicable.
Each segment folder contains:
- the audio file (e.g. SP_D2_012_1.wav)
- the transcription file (e.g. SP_D2_012_1_clipped.txt)
- the .TextGrid file generated by UFPAlign (e.g. SP_D2_012_clipped_1.TextGrid)
- a readMe file containing the command used to generate the .TextGird file with UFPAlign (it contains specific parameters)
- the .txt file that contains each utterance divided by speaker. Its name ends with locutores. (e.g. SP_D2_012_clipped_1_locutores.txt)
- (possibly) a .txt file in which there is a line for every word and its respective speaker. This one is an auxiliary file that the prosodic segmentation code generates to help its process and is identified by its ending locutores_palavras. (e.g. SP_D2_012_clipped_1_locutores_palavras.txt)
- (possibly) the output .TextGrid file in which the utterances are prosodically segmented. Its name ends with OUTPUT. (e.g. SP_D2_012_clipped_1_OUTPUT.TextGrid)
Outside of the segment folders, there are also files that reference all of the segments. There are:
-
the manually segmented .TextGrid file used as reference (e.g. SP_D2_255.TextGrid)
-
a .txt file containing the full transcription of the inquiry (e.g. SP_D2_012.txt)
-
a .TextGrid file in which all of the partial .TextGrid files from the segments were united into a single file (e.g. SP_D2_012_concatenated.TextGrid)
-
a .txt file containing the utterances from all the segments by speaker (e.g. SP_D2_012_locutores.txt)
-
a .txt file containing the utterances from all the segments by speaker, each word in a new line (e.g. SP_D2_012_locutores_palavras.txt)
-
the output .TextGrid file in which the utterances are prosodically segmented. It corresponds to the whole inquiry and its name ends with "OUTPUT". (e.g. SP_D2_255_OUTPUT.TextGrid)
-
(if applicable) the output prosodically segmented file obtained using only the first heuristic (e.g. SP_D2_255_OUTPUT_ONLY_H1.TextGrid)
-
(if applicable) the output prosodically segmented file obtained using only the first and the second heuristic (e.g. SP_D2_255_OUTPUT_ONLY_H1_H2.TextGrid)
-
(if applicable) the output prosodically segmented file obtained using only the silences' heuristic (e.g. SP_D2_255_OUTPUT_ONLY_SIL.TextGrid)
-
(if applicable) a .csv file containing metrics obtained using all of the parameters (e.g. SP_D2_255_metrics.csv)
-
(if applicable) a .csv file containing metrics obtained using only the first heuristic (e.g. SP_D2_255_metrics_ONLY_H1.csv)
-
(if applicable) a .csv file containing metrics obtained using only the first and the second heuristic (e.g. SP_D2_255_metrics_ONLY_H1_H2.csv)
-
(if applicable) a .csv file containing metrics obtained using only the silences' heuristic (e.g. SP_D2_255_metrics_ONLY_SIL.csv)
Note: The files indicated with "v2" in the inquiry SP_DID_242 indicated the files that were generated and obtained after a manual revision of the transcription of its audio.
Note: Some fields are marked with "(possibly)" because in some cases the segments were processed individually, and then concatenated before processed, and in some cases they were only processed after concatenation.
Note: Some fields are marked with "(if applicable)" because the files only exist for the inquiries that contain a reference TextGrid
Note: In SP_D2_012, the indication of speakers was confusing (doc., doc.f, doc.m, inf., inf.f, inf.m) and when standardizing names of speakers we were not sure how many speakers there actually were, so we experimented uniting the utterances of some speakers that seemed to be the same person (files indicated by "3loc","4loc","6loc"). The expert who manually segmented the inquiry chose the version with 3 speakers ("3loc") to use at the article.