This project focuses on training an algorithm to correctly identify the language that the user types in.
The algorithms I used were:
- K Nearest Neighbors
- Multinomial Naive Bayes
- Random Forest
Out of these three algorithms, Random Forest performed the best (although Multinomial Nayes Bayes was a close second).
Note: In case you're wondering how I chose the parameters for each algorithm: I used RandomizedSearchCV
from the sci-kit learn library to arrive at those parameters. I did not include the code for this because it took a long time to run each time.
There are two ways to run the Streamlit web application:
The simplest way to view the web app is by visiting the following link: https://share.streamlit.io/johng034/language-classifier/app.py
If you wish to run the application on your machine, then complete the following steps:
- Clone the repository (click here for instructions on how to clone a GitHub repository)
- Open the folder of this repository in your editor of choice (or in the terminal/command prompt)
- In the terminal, install the packages with
pip install requirements.txt
(you may want to install using a virtual environment for this) - Once the packages are installed, you can run the streamlit application by typing
streamlit run app.py
in the terminal
I am currently considering adding data from wikipedia pages or tweets to train the algorithm on a wider range of data.