Amazon Review Helpfulness Prediction App

This repository contains the Streamlit front-end application for predicting the helpfulness of Amazon reviews using a machine learning model. The app includes both a training and prediction pipeline, and is part of a larger project hosted on Google Cloud Platform (GCP).

App Overview

The application is a Streamlit front-end designed to perform two main functions:

Training Pipeline:
- Data Ingestion: Reads cleaned review data from a BigQuery table.
- Data Processing: Performs feature engineering, text preprocessing, and transformations.
- Model Training: Trains a classification model to predict review helpfulness.
- Logging: Logs all models to MLflow. If the model outperforms previous versions, it is saved to the Vertex AI model registry.
Prediction Pipeline:
- Loads the saved TF-IDF model, numerical transformers, and the trained classification model.
- Takes user input (a review) and returns a prediction on whether the review is likely helpful.

Architecture Summary

The app is part of a larger architecture that performs the following steps:

Data Collection: A Dataproc PySpark job, triggered by a Cloud Scheduler, scrapes Amazon reviews daily and publishes messages to a Pub/Sub topic.
Data Storage: The messages are streamed to BigQuery via a Pub/Sub subscription.
Data Cleaning: A scheduled query processes the raw data, cleans it, and writes it to a BigQuery table.
Model Training & Prediction: The Streamlit app (this repository) reads the cleaned data, trains the model, and logs it to MLflow. The app is also used to generate predictions based on new user input.

Technologies Used

Google Cloud Platform:
- BigQuery: For storing and managing the cleaned review data.
- Pub/Sub: For data streaming and message passing.
Mlflow: For model registry and tracking.
Streamlit: To provide an interactive front-end for training and predictions.
PySpark: Used in the larger architecture for initial data scraping and processing.

Setup Instructions

Clone the Repository

git clone https://github.com/yourusername/amazon-review-helpfulness-app.git
cd amazon-review-helpfulness-app

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.config/gcloud		.config/gcloud
.docker		.docker
.jupyter		.jupyter
.local/share/jupyter		.local/share/jupyter
artifacts		artifacts
mlruns		mlruns
notebook		notebook
src		src
.bashrc		.bashrc
.gitconfig		.gitconfig
.gitignore		.gitignore
.gitignore_backup		.gitignore_backup
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon Review Helpfulness Prediction App

Table of Contents

App Overview

Architecture Summary

Technologies Used

Setup Instructions

About

Releases

Packages

Languages

panubhav2001/end-to-end-review-helpfulness-prediction

Folders and files

Latest commit

History

Repository files navigation

Amazon Review Helpfulness Prediction App

Table of Contents

App Overview

Architecture Summary

Technologies Used

Setup Instructions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages