Sentiment Classification

In this repository we have implemented a couple of deep learning models for binary sentiment analysis:

A Dynamic CNN model, also known as Kim CNN:

This model uses an Embedding Layer to transform all tokens found in a dataset to numerical vectors, forming a matrix. This makes CNN models able to approach the given dataset not as natural language data, but as images made of the tokens' vector representations. Hence, CNN models perform spatial analysis of the dataset. To make training more effective, we provide the model with pretrained word embeddings, using GloVe.
A Long Short-term Memory model:

This model uses a particular RNN cell architecture, called LSTM cell. Those cells are able to correlate data patterns and find temporal dependencies. More specifically, LSTMs are explicitly designed to avoid the long-term dependency problem, remembering information for long periods of time. To utilize the full potential of the model, we add Keras Embedding Layer to the model, which implements custom word embeddings.

Therefore, while the dynamic CNN performs spatial analysis, the LSTM performs termporal data analysis. Those models were tested under two datasets:

The IMDB dataset for binary sentiment classification
The Cornell movie review dataset for binary sentiment classification

Installation

To install the project first install Anaconda, and then execute:

git clone https://github.com/andreasceid/sentiment_classification.git to clone the repository
cd sentiment_classification to access the project directory
pip install -r requirements.txt to install all dependencies
python cnn.py to execute the CNN model
python lstm.py to execute the LSTM model

We recommend to execute both the installation and the scripts as an Administrator. The default training dataset is Cornell movie review dataset. To change the default dataset, edit the code with respect to the path of your custom dataset.

Project Structure

The major project files are:

cnn.py and
lstm.py

There are also Jupyter Notebooks that can be used to better explain the thought process of the developers and provide a better understanding of the models:

cnn.ipynb for the CNN model
lstm.ipynb for the LSTM model

The datasets used to test the models' performance can be found under the dataset directory of the project folder, where:

The Cornell movie review dataset is named MoviesDataset.csv
The IMDB dataset is named IMDB.csv

There is also a directory named docs, where the interested can find the documentation generated using Sphinx. The documentation can also be found at the project's readthedocs.

Results

The models were tested under 2 datasets. The expected results are shown below.

Cornell Dataset

	CNN	LSTM
Device	NVIDIA GTX 1660 Ti	Intel Core i7 9750H
Number of Epochs	20	5
Epoch Benchmark	2 sec.	7 sec.
Loss Function	Binary Cross Entropy	Binary Cross Entropy
Test Loss	0.519	0.7698
Test Accuracy	74.57 %	75.34 %

IMDB Dataset

	CNN	LSTM
Device	NVIDIA GTX 1660 Ti	AMD Ryzen 5 3600
Number of Epochs	20	5
Epoch Benchmark	~19 min.	142 sec.
Loss Function	Binary Cross Entropy	Binary Cross Entropy
Test Loss	0.259	0.3021
Test Accuracy	90.41 %	89.34 %

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
cnn		cnn
dataset		dataset
docs		docs
lstm		lstm
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
wordcloud_illustration.png		wordcloud_illustration.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Classification

Installation

Project Structure

Results

Cornell Dataset

IMDB Dataset

About

Releases

Packages

Contributors 2

Languages

License

AndreasKaratzas/sentiment_classification

Folders and files

Latest commit

History

Repository files navigation

Sentiment Classification

Installation

Project Structure

Results

Cornell Dataset

IMDB Dataset

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages