Create Kallaama speech dataset

gauthelo · Mar 28, 2024 · e42242b · e42242b
commit e42242b
Show file tree

Hide file tree

Showing 902 changed files with 762,778 additions and 0 deletions.
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -0,0 +1,55 @@
+# KALLAAMA
+
+This repository contains data gathered for the KALLAAMA project.    
+This project was funded for 1 year in 2023 by [Lacuna Fund](https://lacunafund.org/).    
+It was led by [Jokalante](https://jokalante.com/) (Dakar, Senegal). [Orange Innovation](https://www.orange.com/) (Lannion, France) and [Ecole Polytechnique de Thiès](https://ept.sn/) (Thiès, Senegal) were also involved as stakeholders.    
+
+### Project description 
+KALLAAMA was a collaborative project which aims to create resources required for the development of speech technologies. This project is exclusively interested in the three most widely spoken languages in Senegal: Wolof, Pulaar and Sereer.    
+
+This work was carried out with the aim of creating resources that will one day make it possible to access information and all the digital resources available today, simply by querying one's device, using one's voice and language of daily use. This is something that can be done by developing robust speech-to-text (ASR) and text-to-speech (TTS) models, that are fundamental units of voicebots.     
+Currently, numerous Senegalese people are excluded from digital information due to the lack of development in this field.    
+
+As a result, this repository provides the created datasets required for the ASR modeling process.    
+Resources involve spoken recordings along with orthographic transcriptions, open source text collection gathered from the Web and wordlists along with phonetic transcription. We also provide a grapheme-to-phoneme model trained to phonetize out-of-vocabulary words for Wolof.    
+
+### Data description
+The main topic of the recordings is about agriculture.   
+Audio files initially belong to Jokalante SARL.     
+- The Wolof (ISO Code 639-2: wol) speech dataset contains 55 hours of transcribed speech, including almost 13 hours of validated content check by an expert.    
+- The Pulaar (ISO Code 639-2: fuc) speech dataset contains nearly 32 hours of transcribed speech, including almost almost 11 hours of validated content check by an expert.    
+- The Sereer (ISO Code 639-2: srr) speech dataset contains 38 hours of transcribed speech, including almost 11 hours of validated content check by an expert.    
+
+In total, we provide 125 hours of transcribed speech, including 35 hours of checked transcriptions.    
+
+### Citation
+See the following publication for more details on data collection (please cite the bibtex if you use the data):    
+```
+@inproceedings{kallaama2024dataset,    
+  title={Kallaama: A Transcribed Speech Dataset about Agriculture in the Three Most Widely Spoken Languages in Senegal}    
+  author={Gauthier, Elodie and Ndiaye, Aminata and Guissé, Abdoulaye}    
+  booktitle={Proceedings of the Fifth workshop on Resources for African Indigenous Languages (RAIL 2024)},    
+  year={2024}    
+}  
+```  
+
+### Repository structure
+
+.    
+├── LICENSE    
+├── README.md    
+└── data/    
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ├── README.md    
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ├── lexicons/    
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ├── text_corpora/    
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; └── transcriptions/    
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ├── checked/    
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; └── raw/    
+
+
+### Contacts
+- Aminata NDIAYE DIALLO (Jokalante, Dakar, Senegal - [email protected])
+- Elodie GAUTHIER (Orange Innovation, Lannion, France - [email protected])
+- Abdoulaye GUISSÉ (École Polytechnique de Thiès, Thiès, Senegal - [email protected])
+
+
diff --git a/data/README.md b/data/README.md
@@ -0,0 +1,37 @@
+# Data
+
+### Description
+This directory contains the whole data collected during the KALLAAMA project.    
+It is composed of texts gathered from the Web, orthographic transcriptions of audio, and a phonetic transcriptions of words in the 3 most widely spoken languages in Senegal: Wolof, Pulaar and Sereer.    
+It is organized as follows:    
+
+.    
+├── lexicons/    
+├── text_corpora/    
+└── transcriptions/    
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ├── checked/    
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; └── raw/    
+
+       
+### Composition
+- **lexicons/**    
+    - This folder contains the data relative to word pronunciation.     
+    The file is a pronunciation lexicon, often used in ASR and TTS modeling.     
+    Each line consists of a word followed by its phonetic transcription (the pronunciation).    
+    We also provide the Grapheme-to-Pronunciation (G2P) model that were trained to generate  phonetic transcriptions from a list of words.    
+
+- **text_corpora/**    
+    - This folder contains texts gathered from the Web.    
+    There is one file per language.    
+
+- **transcriptions/**    
+    - This folder contains transcriptions of recordings (hosting URL to come).        
+    Transcriptions validated by supervisors (someone other than the transcriber) are in *checked/* subfolder.    
+    Original transcriptions made by native speaker linguists are in *raw/* subfolder.    
+    For each subfolder, transcriptions are organized by language, in separate folders (i.e: `transcriptions_fuc` for Pulaar, `transcriptions_srr` for Sereer, `transcriptions_wol` for Wolof).    
+
+
+### Notes
+- Textual dataset creation was carried out by Boubacar DIALLO (Université Assane Seck, Ziguinchor, Sénégal) during its final year intership in NLP during the summer 2023 at Jokalante (Dakar, Senegal). Boubacar speaks Pulaar and Wolof.   
+
+- Transcription work was carried out by Maimouna DIALLO (Université Cheikh Anta Diop, Dakar, Sénégal) for Wolof, Houleye Amadou KANE (Université Cheikh Anta Diop, Dakar, Sénégal) for Pulaar and Fatou DIOUF (Université Cheikh Anta Diop, Dakar, Sénégal) for Sereer, during their summer internship, as part of their linguistics studies, in 2023 at Jokalante.    
diff --git a/data/lexicons/README.md b/data/lexicons/README.md
@@ -0,0 +1,3 @@
+## Pronunciation lexicon ##
+
+We only provide a file for Wolof, because no digitalized references were found for the other languages.
diff --git a/data/lexicons/wolof/README.md b/data/lexicons/wolof/README.md
@@ -0,0 +1,22 @@
+## Pronunciation lexicon ##
+
+### Folder tree
+
+.    
+├── README.md    
+├── wol\_g2p\_ipa.fst    
+├── wol\_g2p\_xsampa.fst    
+└── wol\_lexicon.xsampa.txt       
+
+### File description
+
+- wol\_g2p\_xsampa.fst: the G2P model trained from the lexicon provided in the OpenSLR archive [SLR25](https://www.openslr.org/resources/25/).     
+It has been trained using [Phonetisaurus](https://github.com/AdolfVonKleist/Phonetisaurus) toolkit. 
+Phonetisaurus allows to train and phonetize wordlists thanks to grapheme-to-phoneme FST models.     
+This model is used to generate phonetic transcriptions of a words from a list.     
+Output phonetic symbols will be in [X-SAMPA alphabet](https://en.wikipedia.org/wiki/X-SAMPA).    
+
+- wol\_g2p\_ipa.fst: another G2P model but output phonetic symbols are [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) characters.    
+
+- wol\_lexicon.xsampa.txt: 49,132 phonetised entries from wolof text corpus and transcriptions (phonemes in X-SAMPA characters).
+
diff --git a/data/lexicons/wolof/wol_g2p_ipa.fst b/data/lexicons/wolof/wol_g2p_ipa.fst
diff --git a/data/lexicons/wolof/wol_g2p_xsampa.fst b/data/lexicons/wolof/wol_g2p_xsampa.fst