Skip to content

Latest commit

 

History

History
69 lines (48 loc) · 3.61 KB

README.md

File metadata and controls

69 lines (48 loc) · 3.61 KB

Diacritic-Restoration

A project on diacritic restoration of languages...

Cotributors

Contributor Social Media post Article
Alao David LinkedIn
Alagbe Seun LinkedIn Medium

The Problem:

Yoruba or more specifically "Èdè Yorùbá" is a major language, spoken by the indigenous people of the Yoruba tribe in Nigeria (more than 45 million people globally).

The significance of solving this problem is evidenced by this analysis below:
“Yoruba is a tonal language and Diacritics are used to mark the tone of a vowel. The dot accent or alternatively the vertical line below, are used to mark open variants of vowels, namely ẹ and ọ for [ɛ] and [ɔ]; or below the s which transcribes the [ʃ], a post-alveolar consonant like 'sh' in English.”

Given below are Yoruba vowel classification:

a  =>  (á à ā)  
e => (é è ē ẹ/e̩ ẹ́/é̩ ẹ̀/è̩ ẹ̄/ē̩  
i =>  (í ì ī)  
o =>  (ó ò ō ọ/o̩ ọ́/ó̩ ọ̀/ò̩ ọ̄/ō̩)  
s  => (ṣ s)  
u =>  (ú ù ū)

Using these letters interchangeably, gives different meanings to words as they are pronounced differently.
For example :

A) oro  = oró  
As in “oh-row” 		(oró is “sting” when translated to English)  
  
B)oro  =  ọ̀rọ̀  
As in “aw-raw” 		(ọ̀rọ̀ is “text” or "word" when translated to English)   

C)	 ojo  = òjò  
As in “oh-joe” 		(òjò is “rain” when translated to English)  

D)	ojo  =  ọjọ́  
As in “or-jaw”		   (ọjọ́ is “day” when translated to English)  

It's awesome right?

Unfortunately, there's a problem. Words might be easy to pronounce (maybe not so much for beginners), but when it comes to writing or typing "Yoruba" text, we often disregard the "TONE MARKS", leading to the following:

  1. We end up fumbling the tone marks for each words, we intend on typing (even if we know how to pronounce it well)
  2. Some people just get lazy when putting the "TONE MARKS" because it takes quite a while for them to transcribe plain words and add the "TONE MARKS".

The Solution:

[Excerpt from Alao David's - post]

Being Yorùbá, working on this project was a thing of delight for I and Oluwaseun Alagbe, and definitely an avenue to uplift our culture. 🌻

What we've built here is an A.I. (Natural Language Processing [NLP]) model which utilizes Artificial Neural Networks 🤖, to learn patterns in use for Diacritical (tonal) marks/"Àmì ohùn", in thousands of Yorùbá words, which is possible thanks to the abundant availability of Yorùbá text data, that we mined from open-source Yorùbá literary texts i.e. Books, News articles, etc.

This model when deployed, would go a long way in aiding proper communication in our mother tongue. ✍🏾

Please note

Notebook 1 - Data Mining
Notebook 2 - Text Processing
Notebook 3 - Training and Evaluation