Segmenting demographics data samples with PCA and k-means and building a LightGBM model for a targeted mailout marketing campaign.
Working on data for customers of a mail-order sales company in Germany, aiming to:
- Analyze demographics comparing it against demographics information for the general population and create population clusters with specific characteristics using unsupervised learning techniques.
- Utilize this population segmentation on a third dataset to assist a supervised learning model in predicting which individuals are most likely to be targeted for potential customers for the company.
In this context, what we specifically wanted to look in these datasets is:
- The dominant characteristics (attributes) that can be used to define population segments.
- How are the defined clusters, compared between the two population datasets (general vs. customers) in terms of population distribution.
- The cluster(s) that can be chosen to specifically target with a mail-out campaign.
- Considering the above, if we can predict probabilities for potential customers from a third population demographics dataset.
Please read my related article with results presentation and analysis "Population clustering and marketing campaign target prediction using LightGBM".
An .html
version of the Jupyter notebook is included in the repository files.
There are four data files associated with this project:
Udacity_AZDIAS_052018.csv
: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).Udacity_CUSTOMERS_052018.csv
: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).Udacity_MAILOUT_052018_TRAIN.csv
: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).Udacity_MAILOUT_052018_TEST.csv
: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).
These datasets are provided by Udacity and Arvato Financial Services for this capstone project and are protected by specific terms and conditions prohibiting sharing.
In addition there were two Excel spreadsheets provided, to facilitate data understanding and exploration:
DIAS Information Levels - Attributes 2017.xlsx
: a top-level list of attributes and descriptions, organized by informational category.DIAS Attributes - Values 2017.xlsx
: a detailed mapping of data values for each feature in alphabetical order.
The development environment used for working with the project was Google's Colab spaces, with Python 3.7 and most common packages for data science:
- jupyter
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
In addition, the packages LightGBM, kneed and kaggle python api were used.
The datasets we had to work with were quite large requiring a lot of cleaning and fixing to prepare them for our unsupervised and supervised learning models. Segmentation and finally supervised learning models comparison were certainly more fun. So, results briefly may be summed up:
- Dominant attributes defining population segments, are defined with PCA, in a dataframe with the two highest influencing attributes names in each component used to rename each principal component (PC), providing thus an immediate view of the major characteristics in a PC.
- Cluster population distribution per dataset. Figure shows a comparative visualization of populations' distribution percentage per cluster:
- Clusters specifically chosen to target for a mail-out campaign. We may propose at least two clusters that may be stronger candidates for a mail-out marketing campaign:
- Probabilities prediction for potential customers from a third population demographics dataset. Using LightGBM Classifier for the final probability predictions on
mailout_test
dataset produced an overall testAUC
score of 0.74–0.76.
The project concludes with submission of the predicted probabilities to the relevant Kaggle competition. However, by the time I was able to make these predictions with my selected model the competition had closed, not accepting any more submissions.
Important files:
Arvato Project Workbook.ipynb
the jupyter notebook containing analysis.Arvato Project Workbook.html
a snapshot of the notebook for presentation purposes.DIAS Attributes - Values 2017.xlsx
: a detailed mapping of data values for each feature in alphabetical order.DIAS Information Levels - Attributes 2017.xlsx
: a top-level list of attributes and descriptions, organized by informational category.img
: directory with results images saved fromArvato Project Workbook.ipynb
arvato_data_terms_and_conditions
: directory with the terms and conditions for Arvato's data.
- This project has been created as part of the Udacity Data Science Nanodegree course.
- Data is provided by Arvato Financial Services.