Bertelsmann - Arvato challenge capstone project

Segmenting demographics data samples with PCA and k-means and building a LightGBM model for a targeted mailout marketing campaign.

Motivation

Working on data for customers of a mail-order sales company in Germany, aiming to:

Analyze demographics comparing it against demographics information for the general population and create population clusters with specific characteristics using unsupervised learning techniques.
Utilize this population segmentation on a third dataset to assist a supervised learning model in predicting which individuals are most likely to be targeted for potential customers for the company.

In this context, what we specifically wanted to look in these datasets is:

The dominant characteristics (attributes) that can be used to define population segments.
How are the defined clusters, compared between the two population datasets (general vs. customers) in terms of population distribution.
The cluster(s) that can be chosen to specifically target with a mail-out campaign.
Considering the above, if we can predict probabilities for potential customers from a third population demographics dataset.

Presentation

Please read my related article with results presentation and analysis "Population clustering and marketing campaign target prediction using LightGBM". An .html version of the Jupyter notebook is included in the repository files.

The data

There are four data files associated with this project:

Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

These datasets are provided by Udacity and Arvato Financial Services for this capstone project and are protected by specific terms and conditions prohibiting sharing.

In addition there were two Excel spreadsheets provided, to facilitate data understanding and exploration:

DIAS Information Levels - Attributes 2017.xlsx: a top-level list of attributes and descriptions, organized by informational category.
DIAS Attributes - Values 2017.xlsx: a detailed mapping of data values for each feature in alphabetical order.

Environment

The development environment used for working with the project was Google's Colab spaces, with Python 3.7 and most common packages for data science:

jupyter
pandas
numpy
matplotlib
seaborn
scikit-learn

In addition, the packages LightGBM, kneed and kaggle python api were used.

Results

The datasets we had to work with were quite large requiring a lot of cleaning and fixing to prepare them for our unsupervised and supervised learning models. Segmentation and finally supervised learning models comparison were certainly more fun. So, results briefly may be summed up:

Dominant attributes defining population segments, are defined with PCA, in a dataframe with the two highest influencing attributes names in each component used to rename each principal component (PC), providing thus an immediate view of the major characteristics in a PC.

Cluster population distribution per dataset. Figure shows a comparative visualization of populations' distribution percentage per cluster:

Clusters specifically chosen to target for a mail-out campaign. We may propose at least two clusters that may be stronger candidates for a mail-out marketing campaign:

Probabilities prediction for potential customers from a third population demographics dataset. Using LightGBM Classifier for the final probability predictions on mailout_test dataset produced an overall test AUC score of 0.74–0.76.

Kaggle Competition

The project concludes with submission of the predicted probabilities to the relevant Kaggle competition. However, by the time I was able to make these predictions with my selected model the competition had closed, not accepting any more submissions.

Files and project file structure

Important files:

Arvato Project Workbook.ipynb the jupyter notebook containing analysis.
Arvato Project Workbook.html a snapshot of the notebook for presentation purposes.
DIAS Attributes - Values 2017.xlsx: a detailed mapping of data values for each feature in alphabetical order.
DIAS Information Levels - Attributes 2017.xlsx: a top-level list of attributes and descriptions, organized by informational category.
img: directory with results images saved from Arvato Project Workbook.ipynb
arvato_data_terms_and_conditions: directory with the terms and conditions for Arvato's data.

Acknowledgments

This project has been created as part of the Udacity Data Science Nanodegree course.
Data is provided by Arvato Financial Services.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bertelsmann - Arvato challenge capstone project

Motivation

Presentation

The data

Environment

Results

Kaggle Competition

Files and project file structure

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
arvato_data_terms_and_conditions		arvato_data_terms_and_conditions
img		img
.gitignore		.gitignore
Arvato Project Workbook.html		Arvato Project Workbook.html
Arvato Project Workbook.ipynb		Arvato Project Workbook.ipynb
DIAS Attributes - Values 2017.xlsx		DIAS Attributes - Values 2017.xlsx
DIAS Information Levels - Attributes 2017.xlsx		DIAS Information Levels - Attributes 2017.xlsx
README.md		README.md

chrisliatas/dsnd-customer-segmentation

Folders and files

Latest commit

History

Repository files navigation

Bertelsmann - Arvato challenge capstone project

Motivation

Presentation

The data

Environment

Results

Kaggle Competition

Files and project file structure

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages