End-to-end analytics portfolio project β transforming 99k Brazilian e-commerce orders into actionable business intelligence using Python, Pandas, Plotly, and Streamlit.
πΈ Dashboard Preview - Streamlit app -> https://ecommerceanalytics-fmxbstdwrxkesvrzmc8u2b.streamlit.app/
The dashboard ships with 7 fully interactive pages:
| Page | What you'll find |
|---|---|
| π Overview | KPI scorecards + revenue trend + segment distribution + 6 key insights |
| π Revenue & Trends | Monthly revenue/orders, hourly & day-of-week heatmaps, payment mix |
| π¦ Product Categories | Top 15 categories by revenue & AOV, bubble chart |
| π Delivery Performance | Status breakdown, review scores per status, violin plot, star distribution |
| πΊοΈ Geographic Analysis | State-level revenue bar chart, share pie, sortable data table |
| π― RFM Segmentation | Customer scatter map, treemap, segment deep-dive table |
| π Cohort Analysis | Retention heatmap, average retention curve |
eCommerceAnalytics/
β
βββ app.py # Streamlit dashboard (entry point)
β
βββ data/ # Raw Olist CSV files (not committed β see below)
β βββ olist_customers_dataset.csv
β βββ olist_orders_dataset.csv
β βββ olist_order_items_dataset.csv
β βββ olist_order_payments_dataset.csv
β βββ olist_order_reviews_dataset.csv
β βββ olist_products_dataset.csv
β βββ olist_sellers_dataset.csv
β βββ olist_geolocation_dataset.csv
β βββ product_category_name_translation.csv
β
βββ processed/ # Pre-computed Parquet files (output of notebooks)
β βββ master.parquet # 99,441 orders β all tables joined + features
β βββ delivered.parquet # 96,478 delivered orders only
β βββ rfm.parquet # 93,358 customers with RFM scores & segments
β
βββ data_setup.ipynb # ETL: load, clean, join, feature engineer
βββ eda.ipynb # Exploratory data analysis
βββ RFM_seg_coh.ipynb # RFM scoring, segmentation & cohort analysis
β
βββ myvenv/ # Python virtual environment
Source: Olist Brazilian E-Commerce Public Dataset (Kaggle)
| Table | Rows | Description |
|---|---|---|
| Orders | 99,441 | Transaction header β timestamps, status, delivery dates |
| Order Items | 112,650 | Line items β products, prices, freight |
| Payments | 103,886 | Payment method and value (multiple per order) |
| Reviews | 99,224 | Customer ratings (1β5 stars) |
| Customers | 99,441 | Demographics β state, city |
| Products | 32,951 | Catalog β categories, dimensions, weight |
| Sellers | 3,095 | Seller information |
| Geolocation | 1,000,163 | Lat/long mapping for CEP codes |
Coverage: September 2016 β October 2018 Β· R$ 13.6M total revenue Β· 96k unique customers
Late deliveries average 1.86 β versus 4.31 β for early deliveries β a 2.5Γ gap. With only 4.9% of orders arriving late, logistics improvements have an outsized impact on platform reputation.
SΓ£o Paulo + Rio de Janeiro generate 54% of total revenue. States like Minas Gerais, Bahia, and ParanΓ‘ are underserved relative to their population β clear expansion targets.
Month 1 retention averages just ~10%. Typical of a marketplace (vs. subscription), but even a 5-point improvement would materially lift customer lifetime value.
November 2017 (Black Friday) was the single strongest month β 3Γ the average. Q4 logistics planning is critical to avoid late deliveries exactly when acquisition is highest.
Only 7% of customers qualify as Champions, yet they generate 13% of revenue. Retention spend on this cohort delivers the highest marginal ROI.
Peak purchasing occurs MondayβTuesday at 20:00β21:00 local time. Email and push campaigns timed to this window will see higher conversion rates.
- Loaded and audited 8 source tables with 1M+ rows
- Aggregated payments (multi-payment orders β single row)
- Deduplicated reviews (kept most recent per order)
- Engineered 7 analytical features:
delivery_days,delivery_delta,delivery_status,purchase_month/year/dayofweek/hour,main_category - Persisted to Parquet for fast downstream reads
- Revenue trend analysis with dual-axis monthly view
- Category performance matrix (revenue, volume, AOV)
- Delivery status impact on customer ratings
- State-level geographic breakdown
- Purchase timing patterns (hourly + day-of-week)
- Scored 93k customers on Recency, Frequency, Monetary (quintile 1β5 scale)
- Classified customers into 8 actionable segments using RFM thresholds
- Built 23 Γ 20 cohort retention matrix with month-over-month retention rates
- Python 3.10+
- The Olist dataset downloaded from Kaggle into
data/
# Clone the repository
git clone https://github.com/YOUR_USERNAME/eCommerceAnalytics.git
cd eCommerceAnalytics
# Create and activate virtual environment
python -m venv myvenv
source myvenv/bin/activate # macOS/Linux
myvenv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtjupyter notebook data_setup.ipynb # Step 1 β ETL
jupyter notebook eda.ipynb # Step 2 β EDA
jupyter notebook RFM_seg_coh.ipynb # Step 3 β Segmentationstreamlit run app.pyThen open http://localhost:8501 in your browser.
pandas
numpy
plotly
streamlit
pyarrow # Parquet I/O
jupyter
This project was built as a data analytics portfolio piece to demonstrate:
- Data Engineering β joining and cleaning multi-table relational datasets at scale
- Exploratory Analysis β deriving business insights from raw transactional data
- Customer Analytics β RFM segmentation and cohort retention modelling
- Data Visualisation β interactive Plotly charts with a production-quality Streamlit UI
- Storytelling β translating analysis into prioritised, actionable business recommendations
Feel free to connect or reach out if you have questions about the methodology or want to discuss the findings.





