Using the pretrained BERT Multilingual model, a language detection model was devised. The model was fine tuned using the Wikipedia 40 Billion Multilingual Dataset which contains Wikipedia entries from 41 different languages. The model was trained on 16 of the languages. You may find the dataset here.
- TensorFlow:
>> pip install tensorflow
- TensorFlow Hub:
>> pip install tensorflow-hub
- TensorFlow Datasets:
>> pip install tensorflow-datasets
- TensorFlow Text:
>> pip install tensorflow-text --no-dependencies
Please note that we are making use of the
--no-dependencies
flag because of an error that TensorFlow Text throws pursuant to this following GitHub Issue. If you have already installed TensorFlow text, it is recommended you uninstall and reinstall it
Please also note that after installing TensorFlow Text with this specific flag, you will need to import the file to register a few ops, as highlighted here
- Sci-kit Learn:
>> pip install sklearn
- Download the complete repository
- Under the same file hierarchy as the
lang_finder.py
, download and save the trained model from this link - Import the file
lang_finder.py
and call the functionlang_finder.find_language([str])
which accepts a list of strings as input, and returns list of what language they were written in
NOTE: If you changed the set of languages being used in
modelling.py
for custom training, please update the list of languages specified in the filelang_finder.py
as well for it to run correctly
Download the whole repository and run the file modelling.py
with the command
>> python modelling.py
You can find the list of languages available under the Wiki40B dataset in this link. Simply add the languages to the list list_languages
in the file modelling.py
, update the list in lang_finder.py
as well and run
>> python modelling.py
Everything else is configured to work automatically, just make sure that lang_finder.py
has the same languages as mentioned in modelling.py
in case you make any changes