Arabic transliteration in Python. Similar to Yamli.com, Google Ta3reeb, and Microsoft Maren.
Because there isn't an open source transliteration project available. And it's not that hard!
I'm sure with there are some corner cases that makes it harder and harder to reach the 100% accuracy but it seems it's fairly easy to get the 80%.
- Given a list of simple mappings between one or two english letters representing a single arabic letter
- Append to english letter keys in the mapping vowels to simply ignore the Harakaat.
- Given an english word phonatically representing an arabic word.
- Construct the set of all possible arabic words (valid or not) using a recursive search algorithm.
- Use word frequency to get the most likely word to occur out of the list.
I'm very pleased, even surprised with the initial results. With a better training corpus and some simple tweaking to the rules we can get at least up to 80% accuracy of Yamli or similar services. The current training corpus is a frequency list based on words from opensubtitles.org. And is mostly classical arabic.
See TODO.txt