A non-profit social enterprise is focused on improving the livelihoods of traders and farmers, and particularly women, in East Africa. They provide them with realtime market data through access to online digital resources. They collect demographic data on these traders solutions and develop visuals for researchers.
- π Project Description
- π Requirements
- βοΈ Installation
- Procedure
- Project Structure
- Results
- π License
- π©βπ» Acknowledgements
This project performs limited aggregate analysis of all their user's behavior - on a project evaluation basis. Better understanding our user's interactions will allow for better optimization of their menu design and explore the feasibility of smart menus based on user predicted behavior.
The setup and structure of the clustering segmentation model was used in this project to identify distinct user segments based on demographic data and interaction behavior sourced from the non-profit platform. They offer a range of information services to users in Kenya, Uganda, Rwanda, and Tanzania via a cellular network. Users access these services by dialing a shortcode and navigating through numbered menus. The platform, available in multiple languages, updates hourly with current information covering:
- Market Prices
- Virtual Marketplace
- Currency Exchange Rates
- Weather Forecasts
- Trade and Tax Information
- Financial Management Services
- Agricultural Services
- Business Operations Information
- Legal and Anti-Corruption Information
- COVID-19 Updates
- Health Information
- Corruption Reporting
This project's goal is to bridge information gaps for micro, small, and medium enterprises (MSMEs), enhancing access to timely information. A clustering segmentation model can allow teams to better understand their diverse user base and tailoring services to meet the specific needs of different market segments.
-
Data Transformation:
- Sentence Embeddings: User demographic data is converted into text representations using the Sentence Transformer model.
- The following features are included in the text representation:
- Age
- Border
- Occupation
- Gender
- Education
- Crossing Frequency
- Produce
- Commodity Product
- Commodity Market
- Language
- Procedure Destination
- Country Code
-
Embedding Generation:
- Sentence Transformer Model: The text representations are encoded into dense numerical embeddings using the
sentence-transformers/paraphrase-MiniLM-L3-v2
model.- Tansform the categorical demographic data into a high-dimensional numeric format suitable for clustering.
- Optimal Batch Size: Embeddings are generated with an optimal batch size to ensure efficient processing.
- Sentence Transformer Model: The text representations are encoded into dense numerical embeddings using the
-
Clustering Algorithm:
- KMeans Clustering: The KMeans algorithm is applied to the sentence embeddings to identify distinct user segments.
- Number of Clusters: The optimal number of clusters is determined through experimentation, aiming to balance within-cluster similarity and between-cluster distinctiveness.
-
Cluster Assignment:
- Each user is assigned to one of the identified clusters based on the similarity of their embeddings to the cluster centroids.
- The cluster labels are added to the original dataset for further analysis and visualization.
-
Dimensionality Reduction and Visualization:
- PCA (Principal Component Analysis): Reduces the dimensionality of the data to two principal components for initial visualization and interpretation.
- t-SNE (t-Distributed Stochastic Neighbor Embedding): Provides a non-linear dimensionality reduction to visualize local structures and relationships in the data.
- Python 3.7 or higher
- Pandas
- NumPy
- Scikit-learn
- Matplotlib
- Datetime
- Plotly
- SHAP
pandas==1.2.3
matplotlib==3.3.4
seaborn==0.11.1
geopy==2.1.0
pydrive==1.3.1
plotly==4.14.3
scipy==1.6.0
numpy==1.19.5
- Clone the repository and install the necessary dependencies.
git clone https://github.com/Jchow2/python-market-segmentation-analysis.git
cd python-market-segmentation-analysis
pip install -r requirements.txt
- Change to the project directory:
cd python-market-segmentation-analysis
- Install the required Python packages:
pip install -r requirements.txt
This market segmentation analysis was ran with the relevant scripts as outlined below:
- Data Preparation: Prepare and clean the data using the
data_preparation.py
script:
python src/data_preparation.py
- Feature Engineering: Perform feature engineering and transformation using the
feature_engineering.py
script
python src/feature_engineering.py
- Clustering Analysis: Run the clustering analysis using the
clustering_analysis
.py script
python src/clustering_analysis.py
- Cluster Visualizations: Visualize the clustering results using the pca_visualization.py and tsne_visualization.py scripts:
python src/pca_visualization.py
python src/tsne_visualization.py
- Classification Model Evaluation: Evaluate the classification model using the
classification_model.py
script
python src/classification_model.py
The results, including visualizations and data analysis, were generated and saved in this github repository.
/project-root
βββ notebooks # Jupyter notebooks for exploration and analysis
β βββ sauti-exploratory-data-analysis.ipynb # Notebook for exploratory data analysis
βββ src # Source code for the project
β βββ data_preparation.py # Script for data preparation and cleaning
β βββ feature_engineering.py # Script for feature engineering and transformation
β βββ clustering_analysis.py # Script for clustering analysis
β βββ classification_model.py # Script for classification model evaluation
β βββ pca_visualization.py # Script for PCA visualization
β βββ tsne_visualization.py # Script for t-SNE visualization
β βββ utils.py # Utility functions for the project
βββ tests # Directory for test scripts
β βββ test_data_preparation.py # Test script for data preparation
β βββ test_feature_engineering.py # Test script for feature engineering
β βββ test_clustering_analysis.py # Test script for clustering analysis
β βββ test_classification_model.py # Test script for classification model
βββ results # Directory to save results and visualizations
βββ README.md # Project README file
βββ requirements.txt # List of dependencies
The PCA plot shows how the data points (traders and farmers) are distributed in the new coordinate system. Clusters in the PCA plot indicate groups of data points that are similar to each other. The separation between clusters suggests distinct segments within the data.
- Clusters: There are two distinct clusters of data points, one colored blue and labeled
"second"
and the other colored green and labeled"first"
. - Separation: The clusters are well-separated, indicating that the PCA has effectively reduced the dimensionality of the data while preserving the separation between the clusters.
- Cluster Density: Both clusters are densely packed, indicating that the data points within each cluster are similar to each other.
- Cluster Size: The blue cluster appears to be larger than the green cluster, suggesting that there are more data points in the
"second"
group compared to the"first"
group.
- At lower perplexity values (5), the clusters are more distinct and well-separated, especially at lower learning rates (10 and 100). However, the clusters become more elongated and less distinct.
- At higher perplexity values (30 and 50), the clusters are less distinct and more scattered, especially at higher learning rates (100 and 200).
- Lower learning rates (10) tend to produce more compact and well-defined clusters across all perplexity values.
- Higher learning rates (200) tend to produce more scattered and less distinct clusters, indicating that the learning rate might be too high for effective clustering.
Based on the t-SNE training set:
- Lower perplexity values (5)* seem to produce more meaningful and distinct clusters, especially at lower learning rates.
- Higher perplexity values (30 and 50) result in more scattered and less distinct clusters, suggesting that these values might be too high for the given data.
In this project, we utilize t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize clusters in a lower-dimensional space, which is enhanced by first applying Principal Component Analysis (PCA) to reduce dimensionality. This combination leverages PCA to preserve global structure and t-SNE to capture local relationships, providing a detailed visualization of similar data point groups.
- Clusters: The t-SNE plot shows distinct clusters, indicating clear groupings within the data.
- Cluster Separation: The clear separation suggests that t-SNE has effectively identified distinct groups, crucial for understanding different segments of traders and farmers.
- Cluster Density: Densely packed clusters indicate that data points within each cluster share similar characteristics or behaviors.
- Cluster Size: Varying cluster sizes suggest differences in the number of data points within each group, providing insights into the distribution of traders and farmers. Larger clusters may indicate more prevalent segments.
This project is licensed under the MIT License.
License & Usage Disclaimer This repository's code is licensed under the MIT License. This license applies only to the software and methodology contained within the code files. The project utilized historical data provided for a competition and is intended strictly for portfolio demonstration and educational purposes. The MIT License does not extend to the competition dataset or grant rights to any proprietary data or content belonging to the non-profit or partnering organizations.
We would like to thank the following individuals and organizations for their support and contributions to this project:
- Contributors: Justin Chow, Kyle Kehoe, Anoushka Nayah, Priyansha Rastogi
- Organizations: Sauti East Africa, Miller Center of Social Entrepreneurship, Santa Clara University
- Tools and Libraries: We are grateful to the Sauti developers of the Sauti Platform Demonstration that made this project possible: Interactive Demo
- Mentors and Advisors: Special thanks to Lance Hadley, CEO of Sauti East Africa, for his guidance and advice.
- Community: We appreciate the support and feedback from the Santa Clara University MS of Business Analytics community.