This repository contains the source code of the final project for the course " Text Analytics" at the University of Pisa.
The work done in this repository aims to evaluate reviews from web users using Artificial Intelligence (AI) and natural language processing (NLP) techniques with the ultimate goal of developing classification models useful for prediction tasks.
The dataset was retrieved through a challenge proposed on Kaggle.
The latter was aimed towards the prediction of user ratings, instead it was decided to further develop the challenge into a genre prediction task with the objective of evaluating a book genre just from its reviews. Genres have been extracted from the original source inspired by the challenge: Goodreads.com
FInally, the datasets consist of 10 categorical and numerical features, but only review_text
was used for the scope of the project, by also considering the genre
target variable.
After making sure of the goodness of the data, a pre-processing phase takes place, and different techniques have been applied to the reviews.
NLTK library was the one mostly adopted to fix texts: indeed nltk.word_tokenize
, nltk.corpus.stopwords
, nltk.stem.wordnet.WordNetLemmatizer
, nltk.pos_tag
and more have been used to make the reviews operable with classification models.
Each pre-processed version of the reviews has been been vectorized by the following:
Tokenizer
(provided by Keras, in general used for NNs)CountVectorizer
Tf-idf
Top2Vec
creating a new filtered column based on words similar to the different genre classes
In this last phase several classifiers were tested and different performances have been identified based on the embeddings and strategies. In particular, the following classifiers were exploited for the classification task: LSTM, SVM, Random Forests, Naive Bayes.
Finally a state of the art Transformers (BERT
) has been used to compare performance between classic ML models and text-ad hoc pretrained models.
The main text tools adopted in this project are:
- NLTK a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing etc.
- Gensim a Python library for topic modelling, document indexing and similarity retrieval with large corpora
- Spacy for NER text representation
- PyTorch to deal with Tensors
- Transformers for pre-trained models.
- Scikit-Learn provides machine learning models and utility functions.