Text Preparation

The process of building a language model consists of the following steps:

Data collection
Data cleanup
Model training
Testing

In this project we focus on the two first steps. To build a Statistical Language Model first of all we need to prepare a large collection of clean texts. You can collect text transcription from project like librivox, transcribed podcasts, setup web data collection, by transcribing existing recordings, or generating it artificially with scripts. You can also try to contribute to Voxforge. The most valuable data is a real-life data anyway. Our project's context concerns French conversational speech during meetings. Therefore we tried to collect corpora adapted to our context that we share here with you including their preparation.

Data collection

We have grouped the following corpora by specifying their license and their respective characteristics. Movie subtitles are also good source for spoken language.

Corpora	Constructed at	Licence	Hours/words	Speakers	Database Type
ASCYNT	Université Jean Jaures - Toulouse	Creative Commons	9 H /124 000 words	2 Males - 21 Females	- The oral conference of text (17,858 words), monologue presentations (19,575 words) and guided interviews (86,584). - Audio files and PRAAT transcriptions (textgrid format)
TCOF: Traitement de Corpus Oraux en Français	ATILF Analyse et Traitement Informatique de la Langue Française - Nancy	Creative Commons, Freely available for non-commercial use	124 H	1365 Speakers	Spontaneous Speech. Transcriber et WAV
CFPP2000: COllections de COrpus Oraux Numériques	Université Paris 3 Sarbonne nouvelle	Creative Commons,Freely available for non-commercial use	49 H	Unknown	Interviews
ESLO	Laboratoire Ligérien de Linguistique de l'université d'Orléans en partenariat avec le CNRS et le Ministère de la Culture et la Région Centre	Creative Commons	800 H / around 5 Million of words	Unknown	calls, interview, visit, meeting, diner
Movie Subtitles	https://www.sous-titres.eu/	Free for use	A lot	Plenty	Movie subtitles
CLAPI	ICAR http://ircom.huma-num.fr/site/description_projet.php?projet=CLAPI	Free for use	400 hours	1000	social interactions
Accueil UBS	https://www.ortolang.fr/market/corpora/sldr000890/v1	Creative Commons CC-BY-SA	10000 mots	Plenty	dialogues
LibriVox	https://librivox.org/	Public domain	10000 mots	8,897	Acoustical liberation of books in the public domain

Other free corpora exist, but they are textual or not conversational

DEFT (DÉfi Fouille de Textes): https://deft.limsi.fr/
CoMeRe: https://repository.ortolang.fr/api/content/comere/v3.2/comere.html, https://www.ortolang.fr/market/corpora/comere
ubuntu-fr-cmc (French corpus of computer-mediated communication) http://www.lina.univ-nantes.fr/?-ODISAE-.html
Corpus journalistique issu de l'Est Républicain https://www.ortolang.fr/market/corpora/est_republicain
Projet OFROM http://www11.unine.ch/index.php?page=telechargement
Projet Wortschatz Université de Leipzig / Allemagne http://wortschatz.uni-leipzig.de/ws_fra/, http://corpora2.informatik.uni-leipzig.de/download.htmlhttps://repository.ortolang.fr/api/content/comere/v3.2/comere.htmlhttps://www.ortolang.fr/market/corpora/comere
Athena e-texts http://athena.unige.ch/athena/admin/ath_txt.html
Corpus de journal Radio RFI : https://savoirs.rfi.fr/fr/apprendre-enseigner/langue-francaise/journal-en-fran%C3%A7ais-facile. Transcription + Audio
Liste de corpus Parlé et écrit : https://apps.atilf.fr/homepages/apotheloz/corpus/

Download corpora!

Links to download the corpora are in the following table.

Corpora Name	Download Link
ASCYNT	https://www.ortolang.fr/market/corpora/sldr000832
TCOF	https://www.ortolang.fr/market/corpora/tcof
CFPP2000	http://cfpp2000.univ-paris3.fr/Corpus.html
ESLO	http://ct3.ortolang.fr/data2/eslo/
Movie Subtitles	https://www.sous-titres.eu/

For movie subtitles, you can download the .zip files and then extract them.

Different formats

The corpora that we have gathered use different formats of transcripts. We have the following formats:

Transcriber format (TCOF, CFPP2000)
PRAAT (textgrid format) (ASCYNT)
SubRip and SubStation Alpha subtitle text file format (Movie Subtitles)

When downloading Eslo corpus, you have a "raw" version of transcription. You can synchronize the transcriptions to the sound using transcriber software, which allows segmentation at several levels: sections, speakers, speaking slots. In this website you're going to find more details about Eslo's transcription format.

Data cleanup

In this folder you're going to find all the scripts that we used to generate the kaldi files and to cleanup these different corpora ; to expand abbreviations, convert numbers to words, clean non-word items, replace silence character with , replace n succesive spaces with one space...

Files/Script Includes With This Project

runPrepare.sh that is called by the main script of the project to run the right preparation according to the corpus passed as a parameter (parameters order: corpus name, path to the corpus, path for the Directory output) dataPrepare.sh prepares segment, utt2spk, text, wav.scp files and calls the Parse script to clean up the text.

The following scripts cleanup the text of the transcription according to the transcription format (so to the corpus)

parseTcofSync.py parseAscynt.py parseCfppSync.py parseEsloSync.py parseSubtitles.py

Libraries

List of libraries you should install to be able to run our scripts:

xml.etree.ElementTree The ElementTree XML API
sys System-specific parameters and functions
num2words Convert numbers to words in multiple languages
unidecode ASCII transliterations of Unicode text
re Regular expression operations
os.path Common pathname manipulations

We used Python 3.5.2. If you are using an older version of Python you may experience encoding problems especially with the French language (accents).

Authors

Original Author and Development Lead

Sonia BADENE ([email protected])
Tom JORQUERA ([email protected])
Abdelwahab HEBA ([email protected])

Development

Want to contribute? Great! Share your work with us in the following link:

License

GNU AFFERO GENERAL PUBLIC LICENSE v3

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
ACSYNT_Parse		ACSYNT_Parse
lm		lm
LICENSE.txt		LICENSE.txt
README.md		README.md
data_prepACSYNT.sh		data_prepACSYNT.sh
data_prepESTER.sh		data_prepESTER.sh
data_prepTCOF.sh		data_prepTCOF.sh
parseTcofSync.py		parseTcofSync.py
parseTcofSync_10sec.py		parseTcofSync_10sec.py
prepare_dict.sh		prepare_dict.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Preparation

Data collection

Other free corpora exist, but they are textual or not conversational

Download corpora!

Different formats

Data cleanup

Files/Script Includes With This Project

Libraries

Authors

Development

License

About

Releases

Packages

Contributors 2

Languages

License

openpaas-ng/openpaas-sp5-lm-preparation

Folders and files

Latest commit

History

Repository files navigation

Text Preparation

Data collection

Other free corpora exist, but they are textual or not conversational

Download corpora!

Different formats

Data cleanup

Files/Script Includes With This Project

Libraries

Authors

Development

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages