This repository contains the Streamlit front-end application for predicting the helpfulness of Amazon reviews using a machine learning model. The app includes both a training and prediction pipeline, and is part of a larger project hosted on Google Cloud Platform (GCP).
- App Overview
- Architecture Summary
- Technologies Used
- Setup Instructions
- App Usage
- Future Improvements
The application is a Streamlit front-end designed to perform two main functions:
-
Training Pipeline:
- Data Ingestion: Reads cleaned review data from a BigQuery table.
- Data Processing: Performs feature engineering, text preprocessing, and transformations.
- Model Training: Trains a classification model to predict review helpfulness.
- Logging: Logs all models to MLflow. If the model outperforms previous versions, it is saved to the Vertex AI model registry.
-
Prediction Pipeline:
- Loads the saved TF-IDF model, numerical transformers, and the trained classification model.
- Takes user input (a review) and returns a prediction on whether the review is likely helpful.
The app is part of a larger architecture that performs the following steps:
- Data Collection: A Dataproc PySpark job, triggered by a Cloud Scheduler, scrapes Amazon reviews daily and publishes messages to a Pub/Sub topic.
- Data Storage: The messages are streamed to BigQuery via a Pub/Sub subscription.
- Data Cleaning: A scheduled query processes the raw data, cleans it, and writes it to a BigQuery table.
- Model Training & Prediction: The Streamlit app (this repository) reads the cleaned data, trains the model, and logs it to MLflow. The app is also used to generate predictions based on new user input.
- Google Cloud Platform:
- BigQuery: For storing and managing the cleaned review data.
- Pub/Sub: For data streaming and message passing.
- Mlflow: For model registry and tracking.
- Streamlit: To provide an interactive front-end for training and predictions.
- PySpark: Used in the larger architecture for initial data scraping and processing.
- Clone the Repository
git clone https://github.com/yourusername/amazon-review-helpfulness-app.git cd amazon-review-helpfulness-app