Sentiment Analysis

Two different models for Twitter sentiment analysis - CNN and Naive Bayes Classifier

DATASET

the original dataset was merged in order to manually split it by utilizing the sklearn.TrainTestSplit
contains over 1.6 million annotated Tweets with positive, negative and neutral sentiments
for the purposes of our model, the 169 occurrences with neutral sentiment have been removed
the dataset is shown to be highly balanced, with exactly 1:1 relationship between positive and negative values

in order to practice manual implementation, the preprocessing was done by manually removing junk via regex and NLTK's Stopwords, rather than directly doing it with the Tokenizer
based on Camacho-Collados & Taher Pilhevar's findings in their paper On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis I decided to skip on stemming/lemmatization and proceed directly to tokenization
two different tokenizers were used, one for each of the models, since the output of tf's Tokenizer is not supported by Multinomial Naive Bayes directly, and thus it was neccessary to use the Bag-of-Words model for it. This, however, excludes realistic comparisons of performance between the models, but as that was not the primary goal, I do not consider it a hinderance. This is expected to be a recurring problem if, in the future, I decide to implement additional models which expect different inputs

the Multinomial variation was chosen based on the reports that it performs better than Gaussian in classification tasks - this is yet to be tested personally and on this specific dataset
the model performed reasonably well, reaching 74% accuracy, the value consistent through both precision and recall metrics. It also correctly predicted the sentiment of one test Tweet

for the CNN model, a relatively simple architecture is used containing two convolutional layers and one dense layer
as the model was overfitting massively, batch normalization and an increasing droput rate were introduced to get it under control. For the same reason, a few layers were dropped from the original architecture, and all the hyperparameters were decreased in size, for example, the Embedding Dimension started out as 128, and is now 32. In order to decrease training time, batch size started out as 1000, however, that also played a massive role in overfitting, so it was decreased to 100
after 5 epochs, the model's accuracy and loss stablilize at around 80% and 0.46 respectively, and training is interrupted with an EarlyStopping mechanism
after the evaluation on the test set, the model performed marginally better than the Naive Bayes, so the question arises whether the significantly longer training time of the CNN is worth just a couple more % points in accuracy

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
README.md		README.md
sentiment_analysis.ipynb		sentiment_analysis.ipynb