Introduction

This repository aims to show the result and the methodology used to achieve the objectives of the technical test. The tasks are based on natural language processing, using Python programming language for their development.

Getting Started

Installation process:
- Install conda
- conda create -n testea python=3.6
- conda activate testea
- pip install -r requirements.txt
- python -m nltk.downloader all
- python -m spacy download en_core_web_sm
- python -m spacy download fr_core_news_sm
- python -m spacy download es_core_news_sm
Software dependencies
- All python packages needed are included in requirements.txt
Generate Dataset
- Run this notebook Generate Dataframe.ipynb to create dataset
- Verify that the dataset is in this folder data\dataset_multilng.csv
Test the model:
- If step 1 and 3 are OK you can run the line's code below to deploy a server and test the model.
- Go to app folder
```
cd code\app
```
- Run the following line in cmd
```
python server.py
```
- Go to http://localhost:9000/. Be sure the port 9000 is not busy, otherwise the server does not work.

Methodology

The repository is structured in the following folders:

code: You will find the python scripts and a folder called "notebooks".
data: You will find the folders categorized by documents and the CSV file generated in point 1.
templates: You will find the templates for the server.

Generate Dataset

A dataset was created with all the files that are in the data folder
- To view the Dataset go to Generate Dataframe.ipynb

OBJECTIVE 1: Create a document categorization classifier for the different languages (English, Spanish, French)

Pre-process

To perform pre-processing, a class called "preprocess_data.py" was developed. This class has different methods:

clean_dataframe: It was used to clean / pre-process the data, where null values, duplicate values, stopwords of the three languages and any other noise such as digits and punctuation marks are eliminated. Also, words with less than three characters were removed.
To view the other methods go to notebooks: Pipeline classifier.ipynb

Model

Classifier:

For this task, the TFIDF (Term frequency – Inverse document frequency) was used to transform text into a features to train the model. This method consists of creating a matrix that contains the frequency of the words found in each document.

Two neural networks were trained. One MLP (Multi layer perceptron) and the other LSTM (Long short term memory). In the notebook Pipeline classifier.ipynb the performance of both models is appreciated, the comparison is made using the precision vs recall curve metric.

In this case, by comparing the two models, the LSTM achieved better results for all classes using the metric mentioned before, for this reason it was chosen to make the final model and be tested in the server.

OBJECTIVE 2: Perform a topic model analysis on the provided documents

Pre-process

To perform pre-processing, the same class called "preprocess_data.py" was used.

Model

Topic Modeling:

For this task, was used LDA (Latent Dirichlet Allocation) to find different topics in the dataset. To reproduce the results, go to Topic Modeling LDA.ipynb

CONCLUSION

In accordance with the objectives established in the test; For objective 1, two types of neural networks were used to classify the text regardless of language. The results showed that the optimal way to classify this type of text is to use LSTM because its implementation predicts sequences and, in this case, the text had different sequences for each language.

On the other hand, for objective 2, 10 different topics were found for each language regarding document categorizations. This helps to identify the document according to its context.

It should be noted that the data set was not balanced as the model was expected to perform better for the highest frequency class, therefore, it was decided to use the precision vs recall curve metric to optimally measure each class. As a following approach, classes can be balanced using the following technique: Smote

To display the results in an interactive way and to make it easier for the user to test the model, a website was created using Python programming language with Tornado framework. The user can write any text and the web returns the result of the classification.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Getting Started

Methodology

CONCLUSION

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
code		code
data		data
models		models
templates		templates
Readme.md		Readme.md
requirements.txt		requirements.txt

Jorge13Math/NLP-Multilingual-Document-Classification

Folders and files

Latest commit

History

Repository files navigation

Introduction

Getting Started

Methodology

CONCLUSION

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages