Language detection library in Python. Implementation based on n-gram text categorization, according to the article N-Gram-Based Text Categorization.
Given a text, it returns a list of tuples of length MAX_RESULTS
, sorted according to the probabilites of that text belonging to each language. The tuples are (language_code, probability)
. The language codes follow the ISO 639-1 Standard
- Library usage:
>>> text = """Automatic summarization is the process of reducing a text document with a
computer program in order to create a summary that retains the most important points
of the original document. As the problem of information overload has grown, and as
the quantity of data has increased, so has interest in automatic summarization.
Technologies that can make a coherent summary take into account variables such as
length, writing style and syntax. An example of the use of summarization technology
is search engines such as Google. Document summarization is another."""
>>> import langdetect as ld
>>> print ld.detect_language(text)
[('en', 0.9201609943007083), ('fr', 0.07217134307468472), ('ro', 0.0076676626246070185)]
Since building the models for every language it is a time consuming operation, to perform many detections it is better to use:
>>> profiles = ld.create_languages_profiles()
>>> ld.detect_language(text, profiles)
- Command-line usage:
cd path/to/folder/langdetect/
python langdetect.py -f FILE
The datasets to train, validate and test the software were collected with this scrapper from Wikipedia articles.
Just by cloning the test can be run by:
python test_langdetect.py
This will print out the resulting detection precision for the train, validation and test datasets, for every language. It could be useful to see the results in case of changing the train dataset or at adjusting parameters of the algorithm.
Language | Code |
---|---|
ar | Arabic |
cs | Czech |
da | Danish |
en | English |
et | Estonian |
fi | Finnish |
fr | French |
de | German |
el | Greek |
he | Hebrew |
hu | Hungarian |
it | Italian |
lv | Latvian |
lt | Lithuanian |
no | Norwegian |
fa | Persian |
pl | Polish |
pt | Portuguese |
ro | Romanian |
ru | Russian |
sk | Slovak |
es | Spanish |
sv | Swedish |
- Add the dataset of the new language inside the
datasets/train
directory. The dataset should be text files of the given language inside a directory with the language code as name. According to the article, with 100 Kilobytes is enough. - Add the language code and name to the
LANGUAGES
dictionary in thelangdetect.py
file. - OPTIONAL: if you want to test the new language, add a test dataset under the
datasets/test
directory. Then, add the language code to theTESTING_LANGUAGES
dictionary in thetest_langdetect.py
file.