Skip to content

Language processing classification and prediciton ML project

Notifications You must be signed in to change notification settings

Dpm-a/ML-NLP-Goodreads-Book-Classification

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ww

NLP & Text Analytics


Goodreads Book Reviews Classification

This repository contains the source code of the final project for the course " Text Analytics" at the University of Pisa.

The work done in this repository aims to evaluate reviews from web users using Artificial Intelligence (AI) and natural language processing (NLP) techniques with the ultimate goal of developing classification models useful for prediction tasks.

Phase 1: Retrieving Data

The dataset was retrieved through a challenge proposed on Kaggle.

The latter was aimed towards the prediction of user ratings, instead it was decided to further develop the challenge into a genre prediction task with the objective of evaluating a book genre just from its reviews. Genres have been extracted from the original source inspired by the challenge: Goodreads.com

FInally, the datasets consist of 10 categorical and numerical features, but only review_text was used for the scope of the project, by also considering the genre target variable.

Phase 2: Data Exploration and Preprocessing

After making sure of the goodness of the data, a pre-processing phase takes place, and different techniques have been applied to the reviews.

NLTK library was the one mostly adopted to fix texts: indeed nltk.word_tokenize, nltk.corpus.stopwords, nltk.stem.wordnet.WordNetLemmatizer, nltk.pos_tag and more have been used to make the reviews operable with classification models.

Each pre-processed version of the reviews has been been vectorized by the following:

  • Tokenizer (provided by Keras, in general used for NNs)
  • CountVectorizer
  • Tf-idf
  • Top2Vec creating a new filtered column based on words similar to the different genre classes

Phase 3: ML modelling

In this last phase several classifiers were tested and different performances have been identified based on the embeddings and strategies. In particular, the following classifiers were exploited for the classification task: LSTM, SVM, Random Forests, Naive Bayes.

Finally a state of the art Transformers (BERT) has been used to compare performance between classic ML models and text-ad hoc pretrained models.


Tools

The main text tools adopted in this project are:

  • NLTK a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing etc.
  • Gensim a Python library for topic modelling, document indexing and similarity retrieval with large corpora
  • Spacy for NER text representation
  • PyTorch to deal with Tensors
  • Transformers for pre-trained models.
  • Scikit-Learn provides machine learning models and utility functions.

About

Language processing classification and prediciton ML project

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%