Market Segmentation Clustering Analysis

A non-profit social enterprise is focused on improving the livelihoods of traders and farmers, and particularly women, in East Africa. They provide them with realtime market data through access to online digital resources. They collect demographic data on these traders solutions and develop visuals for researchers.

📖 Project Description

This project performs limited aggregate analysis of all their user's behavior - on a project evaluation basis. Better understanding our user's interactions will allow for better optimization of their menu design and explore the feasibility of smart menus based on user predicted behavior.

Cluster Segmentation Model

The setup and structure of the clustering segmentation model was used in this project to identify distinct user segments based on demographic data and interaction behavior sourced from the non-profit platform. They offer a range of information services to users in Kenya, Uganda, Rwanda, and Tanzania via a cellular network. Users access these services by dialing a shortcode and navigating through numbered menus. The platform, available in multiple languages, updates hourly with current information covering:

Market Prices
Virtual Marketplace
Currency Exchange Rates
Weather Forecasts
Trade and Tax Information
Financial Management Services
Agricultural Services
Business Operations Information
Legal and Anti-Corruption Information
COVID-19 Updates
Health Information
Corruption Reporting

This project's goal is to bridge information gaps for micro, small, and medium enterprises (MSMEs), enhancing access to timely information. A clustering segmentation model can allow teams to better understand their diverse user base and tailoring services to meet the specific needs of different market segments.

Methodology

Data Transformation:
- Sentence Embeddings: User demographic data is converted into text representations using the Sentence Transformer model.
- The following features are included in the text representation:
  - Age
  - Border
  - Occupation
  - Gender
  - Education
  - Crossing Frequency
  - Produce
  - Commodity Product
  - Commodity Market
  - Language
  - Procedure Destination
  - Country Code
Embedding Generation:
- Sentence Transformer Model: The text representations are encoded into dense numerical embeddings using the sentence-transformers/paraphrase-MiniLM-L3-v2 model.
  - Tansform the categorical demographic data into a high-dimensional numeric format suitable for clustering.
  - Optimal Batch Size: Embeddings are generated with an optimal batch size to ensure efficient processing.
Clustering Algorithm:
- KMeans Clustering: The KMeans algorithm is applied to the sentence embeddings to identify distinct user segments.
- Number of Clusters: The optimal number of clusters is determined through experimentation, aiming to balance within-cluster similarity and between-cluster distinctiveness.
Cluster Assignment:
- Each user is assigned to one of the identified clusters based on the similarity of their embeddings to the cluster centroids.
- The cluster labels are added to the original dataset for further analysis and visualization.
Dimensionality Reduction and Visualization:
- PCA (Principal Component Analysis): Reduces the dimensionality of the data to two principal components for initial visualization and interpretation.
- t-SNE (t-Distributed Stochastic Neighbor Embedding): Provides a non-linear dimensionality reduction to visualize local structures and relationships in the data.

🌟 Requirements

Prerequisites

Python 3.7 or higher
Pandas
NumPy
Scikit-learn
Matplotlib
Datetime
Plotly
SHAP

pandas==1.2.3
matplotlib==3.3.4
seaborn==0.11.1
geopy==2.1.0
pydrive==1.3.1
plotly==4.14.3
scipy==1.6.0
numpy==1.19.5

⚙️ Installation

Clone the repository and install the necessary dependencies.

git clone https://github.com/Jchow2/python-market-segmentation-analysis.git
cd python-market-segmentation-analysis
pip install -r requirements.txt

Change to the project directory:

cd python-market-segmentation-analysis

Install the required Python packages:

pip install -r requirements.txt

Procedure

This market segmentation analysis was ran with the relevant scripts as outlined below:

Data Preparation: Prepare and clean the data using the data_preparation.py script:

python src/data_preparation.py

Feature Engineering: Perform feature engineering and transformation using the feature_engineering.py script

python src/feature_engineering.py

Clustering Analysis: Run the clustering analysis using the clustering_analysis.py script

python src/clustering_analysis.py

Cluster Visualizations: Visualize the clustering results using the pca_visualization.py and tsne_visualization.py scripts:

python src/pca_visualization.py
python src/tsne_visualization.py

Classification Model Evaluation: Evaluate the classification model using the classification_model.py script

python src/classification_model.py

The results, including visualizations and data analysis, were generated and saved in this github repository.

Project Structure

/project-root
    ├── notebooks                # Jupyter notebooks for exploration and analysis
    │   ├── sauti-exploratory-data-analysis.ipynb  # Notebook for exploratory data analysis
    ├── src                      # Source code for the project
    │   ├── data_preparation.py  # Script for data preparation and cleaning
    │   ├── feature_engineering.py # Script for feature engineering and transformation
    │   ├── clustering_analysis.py # Script for clustering analysis
    │   ├── classification_model.py # Script for classification model evaluation
    │   ├── pca_visualization.py  # Script for PCA visualization
    │   ├── tsne_visualization.py  # Script for t-SNE visualization
    │   ├── utils.py             # Utility functions for the project
    ├── tests                    # Directory for test scripts
    │   ├── test_data_preparation.py # Test script for data preparation
    │   ├── test_feature_engineering.py # Test script for feature engineering
    │   ├── test_clustering_analysis.py # Test script for clustering analysis
    │   ├── test_classification_model.py # Test script for classification model
    ├── results                  # Directory to save results and visualizations
    ├── README.md                # Project README file
    ├── requirements.txt         # List of dependencies

Result

PCA of Clusters

The PCA plot shows how the data points (traders and farmers) are distributed in the new coordinate system. Clusters in the PCA plot indicate groups of data points that are similar to each other. The separation between clusters suggests distinct segments within the data.

Clusters: There are two distinct clusters of data points, one colored blue and labeled "second" and the other colored green and labeled "first".
Separation: The clusters are well-separated, indicating that the PCA has effectively reduced the dimensionality of the data while preserving the separation between the clusters.
Cluster Density: Both clusters are densely packed, indicating that the data points within each cluster are similar to each other.
Cluster Size: The blue cluster appears to be larger than the green cluster, suggesting that there are more data points in the "second" group compared to the "first" group.

t-SNE with Different Perplexity and Learning Rates

At lower perplexity values (5), the clusters are more distinct and well-separated, especially at lower learning rates (10 and 100). However, the clusters become more elongated and less distinct.
At higher perplexity values (30 and 50), the clusters are less distinct and more scattered, especially at higher learning rates (100 and 200).
Lower learning rates (10) tend to produce more compact and well-defined clusters across all perplexity values.
Higher learning rates (200) tend to produce more scattered and less distinct clusters, indicating that the learning rate might be too high for effective clustering.

Based on the t-SNE training set:

Lower perplexity values (5)* seem to produce more meaningful and distinct clusters, especially at lower learning rates.
Higher perplexity values (30 and 50) result in more scattered and less distinct clusters, suggesting that these values might be too high for the given data.

t-SNE of Clusters

In this project, we utilize t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize clusters in a lower-dimensional space, which is enhanced by first applying Principal Component Analysis (PCA) to reduce dimensionality. This combination leverages PCA to preserve global structure and t-SNE to capture local relationships, providing a detailed visualization of similar data point groups.

Clusters: The t-SNE plot shows distinct clusters, indicating clear groupings within the data.
Cluster Separation: The clear separation suggests that t-SNE has effectively identified distinct groups, crucial for understanding different segments of traders and farmers.
Cluster Density: Densely packed clusters indicate that data points within each cluster share similar characteristics or behaviors.
Cluster Size: Varying cluster sizes suggest differences in the number of data points within each group, providing insights into the distribution of traders and farmers. Larger clusters may indicate more prevalent segments.

License

This project is licensed under the MIT License.

License & Usage Disclaimer This repository's code is licensed under the MIT License. This license applies only to the software and methodology contained within the code files. The project utilized historical data provided for a competition and is intended strictly for portfolio demonstration and educational purposes. The MIT License does not extend to the competition dataset or grant rights to any proprietary data or content belonging to the non-profit or partnering organizations.

Acknowledgements

We would like to thank the following individuals and organizations for their support and contributions to this project:

Contributors: Justin Chow, Kyle Kehoe, Anoushka Nayah, Priyansha Rastogi
Organizations: Sauti East Africa, Miller Center of Social Entrepreneurship, Santa Clara University
Tools and Libraries: We are grateful to the Sauti developers of the Sauti Platform Demonstration that made this project possible: Interactive Demo
Mentors and Advisors: Special thanks to Lance Hadley, CEO of Sauti East Africa, for his guidance and advice.
Community: We appreciate the support and feedback from the Santa Clara University MS of Business Analytics community.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
results		results
src		src
tests		tests
LICENSE		LICENSE
PCA_of_clusters.png		PCA_of_clusters.png
README.md		README.md
partner_png.png		partner_png.png
requirements.txt		requirements.txt
t-SNE_of_clusters.png		t-SNE_of_clusters.png
t-SNE_with_Different_Preplexity_and_Learning_Rates.png		t-SNE_with_Different_Preplexity_and_Learning_Rates.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Market Segmentation Clustering Analysis

📝 Table of Contents

📖 Project Description

Cluster Segmentation Model

Methodology

🌟 Requirements

Prerequisites

⚙️ Installation

Procedure

Project Structure

Result

PCA of Clusters

t-SNE with Different Perplexity and Learning Rates

t-SNE of Clusters

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Jchow2/python-market-segmentation-analysis

Folders and files

Latest commit

History

Repository files navigation

Market Segmentation Clustering Analysis

📝 Table of Contents

📖 Project Description

Cluster Segmentation Model

Methodology

🌟 Requirements

Prerequisites

⚙️ Installation

Procedure

Project Structure

Result

PCA of Clusters

t-SNE with Different Perplexity and Learning Rates

t-SNE of Clusters

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages