Skip to content

This project leverages big data technologies and machine learning to analyze e-commerce user behavior patterns and predict purchase likelihood from a massive clickstream dataset. Built with PySpark for distributed computing and Streamlit for interactive visualization.

Notifications You must be signed in to change notification settings

rigvedrs/Big-Data-E-commerce-Analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

31 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ›’ E-commerce Clickstream Analytics & Purchase Prediction

Python PySpark Streamlit License

Big Data Analytics Project: Analyzing e-commerce user behavior and predicting purchase likelihood using machine learning on large-scale clickstream data.

๐Ÿ“Š Project Overview

This project leverages big data technologies and machine learning to analyze e-commerce user behavior patterns and predict purchase likelihood from a massive clickstream dataset. Built with PySpark for distributed computing and Streamlit for interactive visualization.

๐ŸŽฏ Key Objectives

  • ๐Ÿ“ˆ Analyze product category and brand performance
  • ๐Ÿ›’ Evaluate cart abandonment behavior patterns
  • โฐ Identify temporal shopping trends and peak hours
  • ๐Ÿค– Build ML classification model for purchase prediction
  • ๐Ÿ“ฑ Develop interactive dashboard for real-time insights

๐Ÿ—๏ธ Architecture & Technologies

graph TD
    A[Raw Data 5.6GB] --> B[PySpark Processing]
    B --> C[Feature Engineering]
    C --> D[Random Forest Model]
    C --> E[Analytics Engine]
    D --> F[Streamlit Dashboard]
    E --> F
    F --> G[Real-time Predictions]
    F --> H[Interactive Analytics]
Loading

๐Ÿ› ๏ธ Tech Stack

Component Technology Purpose
Data Processing PySpark 3.5.1 Distributed computing for large datasets
Machine Learning Spark MLlib Classification model training
Web Framework Streamlit Interactive dashboard
Visualization Plotly, Matplotlib Charts and graphs
Data Analysis Pandas, NumPy Data manipulation
Environment Python 3.8+ Core runtime

๐Ÿ“ Project Structure

SP25-CS-GY-6513-Team-14/
โ”œโ”€โ”€ ๐Ÿ“ฑ app.py                          # Streamlit Dashboard Application
โ”œโ”€โ”€ ๐Ÿ“Š Notebook/
โ”‚   โ”œโ”€โ”€ Big_Data_Analytics.ipynb       # Main Analytics Notebook
โ”‚   โ””โ”€โ”€ Training(1).ipynb              # ML Model Training
โ”œโ”€โ”€ ๐Ÿ–ผ๏ธ imgs/                           # Results & Screenshots
โ”‚   โ”œโ”€โ”€ results_1.png
โ”‚   โ””โ”€โ”€ results_2.png
โ”œโ”€โ”€ ๐Ÿค– RF_model/                       # Trained Random Forest Model
โ”‚   โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ metadata/
โ”‚   โ””โ”€โ”€ treesMetadata/
โ”œโ”€โ”€ ๐Ÿ“‹ requirements.txt                # Python Dependencies
โ”œโ”€โ”€ ๐Ÿ“„ data_without_purchase_event.csv # Processed Dataset
โ””โ”€โ”€ ๐Ÿ“‚ extras/                         # Additional Scripts & Data

๐Ÿš€ Quick Start

๐Ÿ”ง Environment Setup

  1. Install ASDF (if not already installed)

    # Follow installation guide: https://asdf-vm.com/guide/getting-started.html
  2. Install Required Plugins

    asdf plugin add python
    asdf plugin add java
  3. Install Runtime Versions

    asdf install
  4. Install Python Dependencies

    pip install -r requirements.txt

๐Ÿƒโ€โ™‚๏ธ Running the Application

  1. Launch Streamlit Dashboard

    streamlit run app.py
  2. Access the Dashboard

    • Open your browser to http://localhost:8501
    • Navigate between Predictions and Analytics tabs

๐Ÿ“Š Key Features & Insights

๐ŸŽฏ Purchase Prediction Model

  • Algorithm: Random Forest Classifier
  • Features: User behavior, product categories, temporal patterns
  • Performance: Optimized for real-time predictions
  • Deployment: Integrated Streamlit interface

๐Ÿ“ˆ Analytics Dashboard

๐Ÿ† Top Performing Categories

Rank Category Browse Count Purchase Count
1 Electronics 15.7M 423K
2 Appliances 4.9M 75K
3 Computers 2.3M 28K

๐Ÿ›’ Cart Abandonment Analysis

  • Electronics: 37.3% abandonment rate
  • Xiaomi: 54.9% abandonment rate
  • Sony: 35.7% abandonment rate

โฐ Peak Shopping Hours

  • Prime Time: 16:00 (4 PM)
  • Recommendation: Flash sales 13:00-16:00 for maximum impact

๐Ÿ“ธ Visual Results

๐Ÿ“น Live Demo of App

Screen.Recording.2025-09-15.at.1.21.30.AM.mov

Interactive demonstration of the Streamlit dashboard with real-time predictions and analytics

๐ŸŽจ Some Results

Analytics

Analytics

๐Ÿ“ˆ Key Findings

๐Ÿช Product Performance

  • Top Category: Electronics dominates with 15.7M browses and 423K purchases
  • Conversion Leader: Notebooks show highest cart-to-purchase conversion
  • Brand Insights: Samsung leads electronics browsing, Apple follows closely

๐Ÿ›’ Shopping Behavior

  • Peak Activity: 16:00 shows maximum user engagement
  • Temporal Trends: Mid-month (days 11-16) see highest purchase intent
  • Cart Abandonment: Varies significantly by category (18.8% - 54.9%)

๐Ÿ’ก Business Recommendations

  1. Flash Sales: Target 13:00-16:00 for maximum impact
  2. Mid-Month Promotions: Capitalize on days 11-16 purchase peaks
  3. Category Focus: Prioritize electronics and appliances inventory
  4. Abandonment Recovery: Implement targeted campaigns for high-abandonment categories

๐Ÿ”ฌ Technical Implementation

๐Ÿง  Machine Learning Pipeline

# Feature Engineering Pipeline
categotyIdxer = StringIndexer(inputCol='category', outputCol='category_idx')
event_typeIdxer = StringIndexer(inputCol='event_type', outputCol='event_type_idx')
brandIdxer = StringIndexer(inputCol='brand', outputCol='brand_idx')

# One-Hot Encoding
one_hot_encoder_category = OneHotEncoder(inputCol="category_idx", outputCol="category_vec")

# Vector Assembly
assembler = VectorAssembler(inputCols=["features_cat", "features_num"], outputCol="features")

# Random Forest Model
rf = RandomForestClassifier(featuresCol="features", labelCol="label")

๐Ÿ“Š Data Processing Highlights

  • Dataset Size: 5.6GB (scalable to 60GB yearly)
  • Records Processed: 42M+ clickstream events
  • Unique Visitors: 3M+ in October 2019
  • Processing Engine: PySpark distributed computing

๐ŸŽฏ Usage Examples

๐Ÿ”ฎ Making Predictions

# Load your data
df = spark.read.option("header", "true").csv("your_data.csv")

# Apply preprocessing pipeline
df_transformed = pipeline.fit(features).transform(features)

# Generate predictions
predictions = model.transform(df_transformed)

# Get user-level predictions
user_predictions = predictions.select(
    "user_id",
    when(col("prediction") == 1, "Will Purchase").otherwise("Won't Purchase").alias("prediction")
)

๐Ÿ“Š Analytics Queries

# Top categories by purchase count
top_categories = df_purchase.groupBy("category").count().orderBy(desc("count"))

# Cart abandonment rate calculation
cart_abandonment = cart_purchase_df.withColumn(
    "abandonment_rate", 
    (1 - col("purchase_count") / col("cart_count")) * 100
)

๐Ÿ“‚ Data Source

Original Dataset: E-commerce Behavior Data

  • Provider: REES46 for eCommerce
  • Size: 5.6GB clickstream data
  • Period: October 2019
  • Events: View, Cart, Purchase

๐Ÿšง Challenges & Solutions

Challenge Solution
Memory Issues Adopted PySpark for distributed processing
Missing Data Implemented UDFs with fallback strategies
Performance Used Spark's lazy evaluation and caching
Visualization Aggregated data before plotting
Deployment Streamlit for interactive dashboard

๐Ÿ”ฎ Future Enhancements

๐ŸŽฏ Planned Features

  • User Segmentation: K-means clustering for behavior analysis
  • Real-time Analytics: Kafka + Spark Streaming integration
  • Sequential Mining: PrefixSpan for navigation pattern analysis
  • Advanced ML: XGBoost and deep learning models
  • Geographic Analysis: Location-based insights

๐Ÿ“Š Advanced Analytics

  • Cohort Analysis: User retention tracking
  • A/B Testing: Feature impact measurement
  • Recommendation Engine: Collaborative filtering
  • Anomaly Detection: Fraud and bot detection

About

This project leverages big data technologies and machine learning to analyze e-commerce user behavior patterns and predict purchase likelihood from a massive clickstream dataset. Built with PySpark for distributed computing and Streamlit for interactive visualization.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published