Skip to content

Using the pretrained BERT Multilingual model, a language detection model was devised. The model was fine tuned using the Wikipedia 40 Billion dataset which contains Wikipedia entries from 41 different languages. The model was trained on 16 of the languages.

License

Notifications You must be signed in to change notification settings

microcoder-py/language-detection-multilingualBERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 

Repository files navigation

Language Detection Using BERT - Base, Cased Multilingual

Overview

Using the pretrained BERT Multilingual model, a language detection model was devised. The model was fine tuned using the Wikipedia 40 Billion Multilingual Dataset which contains Wikipedia entries from 41 different languages. The model was trained on 16 of the languages. You may find the dataset here.

Usage

Prerequisites

  • TensorFlow: >> pip install tensorflow
  • TensorFlow Hub: >> pip install tensorflow-hub
  • TensorFlow Datasets: >> pip install tensorflow-datasets
  • TensorFlow Text: >> pip install tensorflow-text --no-dependencies

Please note that we are making use of the --no-dependencies flag because of an error that TensorFlow Text throws pursuant to this following GitHub Issue. If you have already installed TensorFlow text, it is recommended you uninstall and reinstall it

Please also note that after installing TensorFlow Text with this specific flag, you will need to import the file to register a few ops, as highlighted here

  • Sci-kit Learn: >> pip install sklearn

If you want to perform inference, i.e. simply find what language a given document is written in

  • Download the complete repository
  • Under the same file hierarchy as the lang_finder.py, download and save the trained model from this link
  • Import the file lang_finder.py and call the function lang_finder.find_language([str]) which accepts a list of strings as input, and returns list of what language they were written in

NOTE: If you changed the set of languages being used in modelling.py for custom training, please update the list of languages specified in the file lang_finder.py as well for it to run correctly

If you want to train a new model directly within Google Colaboratory:

Link To Google Colab

If you want to train a new model locally

Download the whole repository and run the file modelling.py with the command

>> python modelling.py

If you want to train it on more, or different languages

You can find the list of languages available under the Wiki40B dataset in this link. Simply add the languages to the list list_languages in the file modelling.py, update the list in lang_finder.py as well and run

>> python modelling.py

Everything else is configured to work automatically, just make sure that lang_finder.py has the same languages as mentioned in modelling.py in case you make any changes

About

Using the pretrained BERT Multilingual model, a language detection model was devised. The model was fine tuned using the Wikipedia 40 Billion dataset which contains Wikipedia entries from 41 different languages. The model was trained on 16 of the languages.

Topics

Resources

License

Stars

Watchers

Forks

Languages