This program consists of a classifier that is able to decide whether a movie review is good or bad. The classifier is trained on an IMDB movie review dataset of 50k reviews. The program also spits out the accuracy and precision of the classifier.
- Numpy
- Pandas
- NLTK
- BeautifulSoup
- SciKit Learn
- Matplotlib
The data we used is taken from Kaggle - IMDB 50K Movie Reviews, which was in turn taken from a Stanford dataset (look to References for more).
In particular, merged_data.csv
is the the combination of the 'train' and 'test' files into one big file. We decided to merge them to be able to manage the proportion of the train and test set.
Instead, toy_data.csv
is just a smaller version of the data.
First of all, the required libraries have to be installed on the local machine to be able to run the program.
With pip, it can be done by
$ pip install -r requirements.txt
.
The program is stored as sentiment_classifier.py
, so running it with python is all you need to do.
We decided to use NLTK because, being a toolkit released in 2001, it can give us insights in how things changed through time, to be able to compare it to the newer models that other courses will delve into.
The classifier training has various steps:
- Loading and cleaning the data
- Extract features from the reviews
- Feed the features to the classifier
- Analyze its accuracy
The most fundamental step was the Feature Extraction, as it required us to define what is important from the text and what is not.
We also decided to create features with the Bag of Words method: we do not have a specific list of words we are looking for, but we decided to look into various grammatical categories (Nouns, Verbs, Adjectives, Adverbs and Numbers) and lemmatize them (reduce them to their dictionary form) to have a manageable size of the features.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).