Big Data Analytics Project: Analyzing e-commerce user behavior and predicting purchase likelihood using machine learning on large-scale clickstream data.
This project leverages big data technologies and machine learning to analyze e-commerce user behavior patterns and predict purchase likelihood from a massive clickstream dataset. Built with PySpark for distributed computing and Streamlit for interactive visualization.
- ๐ Analyze product category and brand performance
- ๐ Evaluate cart abandonment behavior patterns
- โฐ Identify temporal shopping trends and peak hours
- ๐ค Build ML classification model for purchase prediction
- ๐ฑ Develop interactive dashboard for real-time insights
graph TD
A[Raw Data 5.6GB] --> B[PySpark Processing]
B --> C[Feature Engineering]
C --> D[Random Forest Model]
C --> E[Analytics Engine]
D --> F[Streamlit Dashboard]
E --> F
F --> G[Real-time Predictions]
F --> H[Interactive Analytics]
Component | Technology | Purpose |
---|---|---|
Data Processing | PySpark 3.5.1 | Distributed computing for large datasets |
Machine Learning | Spark MLlib | Classification model training |
Web Framework | Streamlit | Interactive dashboard |
Visualization | Plotly, Matplotlib | Charts and graphs |
Data Analysis | Pandas, NumPy | Data manipulation |
Environment | Python 3.8+ | Core runtime |
SP25-CS-GY-6513-Team-14/
โโโ ๐ฑ app.py # Streamlit Dashboard Application
โโโ ๐ Notebook/
โ โโโ Big_Data_Analytics.ipynb # Main Analytics Notebook
โ โโโ Training(1).ipynb # ML Model Training
โโโ ๐ผ๏ธ imgs/ # Results & Screenshots
โ โโโ results_1.png
โ โโโ results_2.png
โโโ ๐ค RF_model/ # Trained Random Forest Model
โ โโโ data/
โ โโโ metadata/
โ โโโ treesMetadata/
โโโ ๐ requirements.txt # Python Dependencies
โโโ ๐ data_without_purchase_event.csv # Processed Dataset
โโโ ๐ extras/ # Additional Scripts & Data
-
Install ASDF (if not already installed)
# Follow installation guide: https://asdf-vm.com/guide/getting-started.html
-
Install Required Plugins
asdf plugin add python asdf plugin add java
-
Install Runtime Versions
asdf install
-
Install Python Dependencies
pip install -r requirements.txt
-
Launch Streamlit Dashboard
streamlit run app.py
-
Access the Dashboard
- Open your browser to
http://localhost:8501
- Navigate between Predictions and Analytics tabs
- Open your browser to
- Algorithm: Random Forest Classifier
- Features: User behavior, product categories, temporal patterns
- Performance: Optimized for real-time predictions
- Deployment: Integrated Streamlit interface
Rank | Category | Browse Count | Purchase Count |
---|---|---|---|
1 | Electronics | 15.7M | 423K |
2 | Appliances | 4.9M | 75K |
3 | Computers | 2.3M | 28K |
- Electronics: 37.3% abandonment rate
- Xiaomi: 54.9% abandonment rate
- Sony: 35.7% abandonment rate
- Prime Time: 16:00 (4 PM)
- Recommendation: Flash sales 13:00-16:00 for maximum impact
Screen.Recording.2025-09-15.at.1.21.30.AM.mov
Interactive demonstration of the Streamlit dashboard with real-time predictions and analytics
- Top Category: Electronics dominates with 15.7M browses and 423K purchases
- Conversion Leader: Notebooks show highest cart-to-purchase conversion
- Brand Insights: Samsung leads electronics browsing, Apple follows closely
- Peak Activity: 16:00 shows maximum user engagement
- Temporal Trends: Mid-month (days 11-16) see highest purchase intent
- Cart Abandonment: Varies significantly by category (18.8% - 54.9%)
- Flash Sales: Target 13:00-16:00 for maximum impact
- Mid-Month Promotions: Capitalize on days 11-16 purchase peaks
- Category Focus: Prioritize electronics and appliances inventory
- Abandonment Recovery: Implement targeted campaigns for high-abandonment categories
# Feature Engineering Pipeline
categotyIdxer = StringIndexer(inputCol='category', outputCol='category_idx')
event_typeIdxer = StringIndexer(inputCol='event_type', outputCol='event_type_idx')
brandIdxer = StringIndexer(inputCol='brand', outputCol='brand_idx')
# One-Hot Encoding
one_hot_encoder_category = OneHotEncoder(inputCol="category_idx", outputCol="category_vec")
# Vector Assembly
assembler = VectorAssembler(inputCols=["features_cat", "features_num"], outputCol="features")
# Random Forest Model
rf = RandomForestClassifier(featuresCol="features", labelCol="label")
- Dataset Size: 5.6GB (scalable to 60GB yearly)
- Records Processed: 42M+ clickstream events
- Unique Visitors: 3M+ in October 2019
- Processing Engine: PySpark distributed computing
# Load your data
df = spark.read.option("header", "true").csv("your_data.csv")
# Apply preprocessing pipeline
df_transformed = pipeline.fit(features).transform(features)
# Generate predictions
predictions = model.transform(df_transformed)
# Get user-level predictions
user_predictions = predictions.select(
"user_id",
when(col("prediction") == 1, "Will Purchase").otherwise("Won't Purchase").alias("prediction")
)
# Top categories by purchase count
top_categories = df_purchase.groupBy("category").count().orderBy(desc("count"))
# Cart abandonment rate calculation
cart_abandonment = cart_purchase_df.withColumn(
"abandonment_rate",
(1 - col("purchase_count") / col("cart_count")) * 100
)
Original Dataset: E-commerce Behavior Data
- Provider: REES46 for eCommerce
- Size: 5.6GB clickstream data
- Period: October 2019
- Events: View, Cart, Purchase
Challenge | Solution |
---|---|
Memory Issues | Adopted PySpark for distributed processing |
Missing Data | Implemented UDFs with fallback strategies |
Performance | Used Spark's lazy evaluation and caching |
Visualization | Aggregated data before plotting |
Deployment | Streamlit for interactive dashboard |
- User Segmentation: K-means clustering for behavior analysis
- Real-time Analytics: Kafka + Spark Streaming integration
- Sequential Mining: PrefixSpan for navigation pattern analysis
- Advanced ML: XGBoost and deep learning models
- Geographic Analysis: Location-based insights
- Cohort Analysis: User retention tracking
- A/B Testing: Feature impact measurement
- Recommendation Engine: Collaborative filtering
- Anomaly Detection: Fraud and bot detection