Skip to content

Sauti East Africa request for a segmentation analysis on all of their user's behavior. Sauti wishes to better optimize their menu design and explore the feasibility of smart menus based on user predicted behavior.

License

Notifications You must be signed in to change notification settings

Jchow2/python-market-segmentation-analysis

Repository files navigation

Partner Logo

Market Segmentation Clustering Analysis

A non-profit social enterprise is focused on improving the livelihoods of traders and farmers, and particularly women, in East Africa. They provide them with realtime market data through access to online digital resources. They collect demographic data on these traders solutions and develop visuals for researchers.


πŸ“ Table of Contents


πŸ“– Project Description

This project performs limited aggregate analysis of all their user's behavior - on a project evaluation basis. Better understanding our user's interactions will allow for better optimization of their menu design and explore the feasibility of smart menus based on user predicted behavior.

Cluster Segmentation Model

The setup and structure of the clustering segmentation model was used in this project to identify distinct user segments based on demographic data and interaction behavior sourced from the non-profit platform. They offer a range of information services to users in Kenya, Uganda, Rwanda, and Tanzania via a cellular network. Users access these services by dialing a shortcode and navigating through numbered menus. The platform, available in multiple languages, updates hourly with current information covering:

  • Market Prices
  • Virtual Marketplace
  • Currency Exchange Rates
  • Weather Forecasts
  • Trade and Tax Information
  • Financial Management Services
  • Agricultural Services
  • Business Operations Information
  • Legal and Anti-Corruption Information
  • COVID-19 Updates
  • Health Information
  • Corruption Reporting

This project's goal is to bridge information gaps for micro, small, and medium enterprises (MSMEs), enhancing access to timely information. A clustering segmentation model can allow teams to better understand their diverse user base and tailoring services to meet the specific needs of different market segments.

Methodology

  1. Data Transformation:

    • Sentence Embeddings: User demographic data is converted into text representations using the Sentence Transformer model.
    • The following features are included in the text representation:
      • Age
      • Border
      • Occupation
      • Gender
      • Education
      • Crossing Frequency
      • Produce
      • Commodity Product
      • Commodity Market
      • Language
      • Procedure Destination
      • Country Code
  2. Embedding Generation:

    • Sentence Transformer Model: The text representations are encoded into dense numerical embeddings using the sentence-transformers/paraphrase-MiniLM-L3-v2 model.
      • Tansform the categorical demographic data into a high-dimensional numeric format suitable for clustering.
      • Optimal Batch Size: Embeddings are generated with an optimal batch size to ensure efficient processing.
  3. Clustering Algorithm:

    • KMeans Clustering: The KMeans algorithm is applied to the sentence embeddings to identify distinct user segments.
    • Number of Clusters: The optimal number of clusters is determined through experimentation, aiming to balance within-cluster similarity and between-cluster distinctiveness.
  4. Cluster Assignment:

    • Each user is assigned to one of the identified clusters based on the similarity of their embeddings to the cluster centroids.
    • The cluster labels are added to the original dataset for further analysis and visualization.
  5. Dimensionality Reduction and Visualization:

    • PCA (Principal Component Analysis): Reduces the dimensionality of the data to two principal components for initial visualization and interpretation.
    • t-SNE (t-Distributed Stochastic Neighbor Embedding): Provides a non-linear dimensionality reduction to visualize local structures and relationships in the data.

🌟 Requirements

Prerequisites

  • Python 3.7 or higher
  • Pandas
  • NumPy
  • Scikit-learn
  • Matplotlib
  • Datetime
  • Plotly
  • SHAP
pandas==1.2.3
matplotlib==3.3.4
seaborn==0.11.1
geopy==2.1.0
pydrive==1.3.1
plotly==4.14.3
scipy==1.6.0
numpy==1.19.5

βš™οΈ Installation

  1. Clone the repository and install the necessary dependencies.
git clone https://github.com/Jchow2/python-market-segmentation-analysis.git
cd python-market-segmentation-analysis
pip install -r requirements.txt
  1. Change to the project directory:
cd python-market-segmentation-analysis
  1. Install the required Python packages:
pip install -r requirements.txt

Procedure

This market segmentation analysis was ran with the relevant scripts as outlined below:

  1. Data Preparation: Prepare and clean the data using the data_preparation.py script:
python src/data_preparation.py
  1. Feature Engineering: Perform feature engineering and transformation using the feature_engineering.py script
python src/feature_engineering.py
  1. Clustering Analysis: Run the clustering analysis using the clustering_analysis.py script
python src/clustering_analysis.py
  1. Cluster Visualizations: Visualize the clustering results using the pca_visualization.py and tsne_visualization.py scripts:
python src/pca_visualization.py
python src/tsne_visualization.py
  1. Classification Model Evaluation: Evaluate the classification model using the classification_model.py script
python src/classification_model.py

The results, including visualizations and data analysis, were generated and saved in this github repository.


Project Structure

/project-root
    β”œβ”€β”€ notebooks                # Jupyter notebooks for exploration and analysis
    β”‚   β”œβ”€β”€ sauti-exploratory-data-analysis.ipynb  # Notebook for exploratory data analysis
    β”œβ”€β”€ src                      # Source code for the project
    β”‚   β”œβ”€β”€ data_preparation.py  # Script for data preparation and cleaning
    β”‚   β”œβ”€β”€ feature_engineering.py # Script for feature engineering and transformation
    β”‚   β”œβ”€β”€ clustering_analysis.py # Script for clustering analysis
    β”‚   β”œβ”€β”€ classification_model.py # Script for classification model evaluation
    β”‚   β”œβ”€β”€ pca_visualization.py  # Script for PCA visualization
    β”‚   β”œβ”€β”€ tsne_visualization.py  # Script for t-SNE visualization
    β”‚   β”œβ”€β”€ utils.py             # Utility functions for the project
    β”œβ”€β”€ tests                    # Directory for test scripts
    β”‚   β”œβ”€β”€ test_data_preparation.py # Test script for data preparation
    β”‚   β”œβ”€β”€ test_feature_engineering.py # Test script for feature engineering
    β”‚   β”œβ”€β”€ test_clustering_analysis.py # Test script for clustering analysis
    β”‚   β”œβ”€β”€ test_classification_model.py # Test script for classification model
    β”œβ”€β”€ results                  # Directory to save results and visualizations
    β”œβ”€β”€ README.md                # Project README file
    β”œβ”€β”€ requirements.txt         # List of dependencies

Result

PCA of Clusters

PCA of clusters

The PCA plot shows how the data points (traders and farmers) are distributed in the new coordinate system. Clusters in the PCA plot indicate groups of data points that are similar to each other. The separation between clusters suggests distinct segments within the data.

  • Clusters: There are two distinct clusters of data points, one colored blue and labeled "second" and the other colored green and labeled "first".
  • Separation: The clusters are well-separated, indicating that the PCA has effectively reduced the dimensionality of the data while preserving the separation between the clusters.
  • Cluster Density: Both clusters are densely packed, indicating that the data points within each cluster are similar to each other.
  • Cluster Size: The blue cluster appears to be larger than the green cluster, suggesting that there are more data points in the "second" group compared to the "first" group.

t-SNE with Different Perplexity and Learning Rates

t-SNE with Different Preplexity and Learning Rates

  • At lower perplexity values (5), the clusters are more distinct and well-separated, especially at lower learning rates (10 and 100). However, the clusters become more elongated and less distinct.
  • At higher perplexity values (30 and 50), the clusters are less distinct and more scattered, especially at higher learning rates (100 and 200).
  • Lower learning rates (10) tend to produce more compact and well-defined clusters across all perplexity values.
  • Higher learning rates (200) tend to produce more scattered and less distinct clusters, indicating that the learning rate might be too high for effective clustering.

Based on the t-SNE training set:

  • Lower perplexity values (5)* seem to produce more meaningful and distinct clusters, especially at lower learning rates.
  • Higher perplexity values (30 and 50) result in more scattered and less distinct clusters, suggesting that these values might be too high for the given data.

t-SNE of Clusters

t-SNE of clusters

In this project, we utilize t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize clusters in a lower-dimensional space, which is enhanced by first applying Principal Component Analysis (PCA) to reduce dimensionality. This combination leverages PCA to preserve global structure and t-SNE to capture local relationships, providing a detailed visualization of similar data point groups.

  • Clusters: The t-SNE plot shows distinct clusters, indicating clear groupings within the data.
  • Cluster Separation: The clear separation suggests that t-SNE has effectively identified distinct groups, crucial for understanding different segments of traders and farmers.
  • Cluster Density: Densely packed clusters indicate that data points within each cluster share similar characteristics or behaviors.
  • Cluster Size: Varying cluster sizes suggest differences in the number of data points within each group, providing insights into the distribution of traders and farmers. Larger clusters may indicate more prevalent segments.

License

This project is licensed under the MIT License.

License & Usage Disclaimer This repository's code is licensed under the MIT License. This license applies only to the software and methodology contained within the code files. The project utilized historical data provided for a competition and is intended strictly for portfolio demonstration and educational purposes. The MIT License does not extend to the competition dataset or grant rights to any proprietary data or content belonging to the non-profit or partnering organizations.

Acknowledgements

We would like to thank the following individuals and organizations for their support and contributions to this project:

  • Contributors: Justin Chow, Kyle Kehoe, Anoushka Nayah, Priyansha Rastogi
  • Organizations: Sauti East Africa, Miller Center of Social Entrepreneurship, Santa Clara University
  • Tools and Libraries: We are grateful to the Sauti developers of the Sauti Platform Demonstration that made this project possible: Interactive Demo
  • Mentors and Advisors: Special thanks to Lance Hadley, CEO of Sauti East Africa, for his guidance and advice.
  • Community: We appreciate the support and feedback from the Santa Clara University MS of Business Analytics community.

About

Sauti East Africa request for a segmentation analysis on all of their user's behavior. Sauti wishes to better optimize their menu design and explore the feasibility of smart menus based on user predicted behavior.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published