Skip to content

Develop a model to predict which retail customers will respond to a marketing campaign. Logistic Regression shows the best performance.

Notifications You must be signed in to change notification settings

emrecanduran/Marketing-Campaign-Response-Prediction

Repository files navigation

Marketing Campaign Response Prediction Project

CRISP-DM Methodology

CRISP-DM is like our roadmap for exploring and understanding data. It breaks down complex data projects into manageable steps, starting from understanding what the business needs are, all the way to deploying our solutions. It's super helpful because it keeps us organized and ensures we're on track with our goals. Plus, it's not just for one specific industry, so we can apply it to all sorts of projects, which is awesome for learning and building our skills. It's basically our guide to navigating the world of data mining!

CRISP-DM process diagram

CRISP-DM process diagram

  • Business Understanding: determine business objectives; assess situation; determine data mining goals; produce project plan
  • Data Understanding: collect initial data; describe data; explore data; verify data quality
  • Data Preparation (generally, the most time-consuming phase): select data; clean data; construct data; integrate data; format data
  • Modeling: select modeling technique; generate test design; build model; assess model
  • Evaluation: evaluate results; review process; determine next steps
  • Deployment: plan deployment; plan monitoring and maintenance; produce final report; review project (deployment was not required for this project)

Reference

Overview

It is asked to develop a prediction model in such a way that it will be possible for the Marketing Department of a retail company to predict which customers are likely to respond to a marketing campaing based on information from a previous campaign.

A response model can provide a significant boost to the efficiency of a marketing campaign by increasing responses or reducing expenses.

What is needed for the model to be considered successfull ?

Product manager Sarah striving to optimize the company marketing campaigns. With a keen eye on metrics like recall(>0.75) and F1 score, Sarah ensures their campaigns reach a broad audience (recall) while maintaining precision in targeting (F1 score > 0.5).

Model building

To achieve the objective, it is followed a systematic approach, CRISP-DM, that involves several stages and proceed to train and evaluate predictive models using 4 different machine learning models with their best parameters such as:

  • Decision Tree
  • Random Forest
  • XGBoost
  • Logistic Regression

It is tried to explore various models from different families, including bagging techniques like Random Forest, boosting methods such as XGBoost, as well as Logistic Regression which is linear model. Logistic Regression is selected as a best model.

Questions or additional informations

For any additional questions, don't hesitate to get in touch with the instructors. The instructors will also act as the company business/project stakeholder.

Dataset Description


Feature Description
AcceptedCmp1 1 if customer accepted the offer in the 1st campaign, 0 otherwise
AcceptedCmp2 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
AcceptedCmp3 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
AcceptedCmp4 1 if customer accepted the offer in the 4th campaign, 0 otherwise
AcceptedCmp5 1 if customer accepted the offer in the 5th campaign, 0 otherwise
Response (target) 1 if customer accepted the offer in the last campaign, 0 otherwise
Complain 1 if customer complained in the last 2 years
DtCustomer Date of customer’s enrolment with the company
Education Customer’s level of education
Marital Customer’s marital status
Kidhome Number of small children in customer’s household
Teenhome Number of teenagers in customer’s household
Income Customer’s yearly household income
MntFishProducts Amount spent on fish products in the last 2 years
MntMeatProducts Amount spent on meat products in the last 2 years
MntFruits Amount spent on fruits products in the last 2 years
MntSweetProducts Amount spent on sweet products in the last 2 years
MntWines Amount spent on wine products in the last 2 years
MntGoldProds Amount spent on gold products in the last 2 years
NumDealsPurchases Number of purchases made with discount
NumCatalogPurchases Number of purchases made using catalogue
NumStorePurchases Number of purchases made directly in stores
NumWebPurchases Number of purchases made through company’s web site
NumWebVisitsMonth Number of visits to company’s web site in the last month
Recency Number of days since the last purchase

Imports

This project has following libraries:

# Data Manipulation and Analysis
import pandas as pd
import numpy as np
import re
import collections
from datetime import datetime

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import graphviz

# Data Preprocessing
import imblearn
import sklearn
from imblearn.over_sampling import BorderlineSMOTE
from sklearn.preprocessing import MinMaxScaler
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer

# Machine Learning Model

from sklearn.linear_model import LogisticRegression
from yellowbrick.model_selection import RFECV

# Model Selection and Evaluation
from sklearn.model_selection import GridSearchCV, train_test_split
from yellowbrick.model_selection import CVScores, LearningCurve
from yellowbrick.classifier import DiscriminationThreshold, ClassPredictionError, PrecisionRecallCurve, ROCAUC
from sklearn import metrics 
import statsmodels.api as sm

# Save the model
import pickle

# Assign random state for reproducebility
random_state=2024

# Set seaborn style
sns.set_style("whitegrid")

# Set the max_columns option to None to display all columns
pd.set_option('display.max_columns', None)

Versions

pandas version: 2.1.4
numpy version: 1.23.5
matplotlib version: 3.8.0
seaborn version: 0.13.2
scikit-learn version: 1.3.2

Results

Train and test results:

Metric Training Set Test Set
Accuracy 79.2% 77.6%
Precision 79.1% 38.2%
Recall 79.5% 79.5%
F1-score 79.3% 51.6%
Specificity 79.0% 77.3%
G-Mean 79.2% 78.4%

Accuracy: The model's accuracy on the training set is 79.2%, while on the test set, it is slightly lower at 77.6%. This indicates that the model performs fairly consistently across both datasets, with a slight drop in the test set, which is expected.

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision: Precision drops significantly from 79.1% on the training set to 38.2% on the test set. This suggests that the model is much better at correctly identifying positive instances on the training set than on the test set.

Precision = TP / (TP + FP)

Recall: Recall remains constant at 79.5% for both the training and test sets. This consistency indicates that the model is equally effective at capturing true positive instances across both sets.

Recall = TP / (TP + FN)

F1-score: The F1-score, which balances precision and recall, drops from 79.3% on the training set to 51.6% on the test set. This significant decrease aligns with the drop in precision, highlighting issues with the model's performance in identifying true positives accurately on the test set. This decrease in F1-score on the test set indicates a trade-off between precision and recall.

F1-score = 2 * (Precision * Recall) / (Precision + Recall)

Specificity: Specificity, or the ability to correctly identify negative instances, is 79.0% on the training set and 77.3% on the test set. This minor decrease indicates the model's performance in recognizing negatives is relatively stable across both sets.

Specificity = TN / (TN + FP)

G-mean: The geometric mean (G-Mean), which combines the true positive rate (recall) and the true negative rate (specificity), is 79.2% on the training set and 78.4% on the test set. This shows a balanced performance in identifying both positives and negatives, with only a slight reduction on the test set.

G-mean = sqrt(Recall * Specificity)


Feature Importance:

Feature Coefficient
total_amount_spent 4.608256
total_campaigns_accepted 3.859582
NumWebVisitsMonth 3.167518
Recency 3.036887
family_size 2.903364
income_level 1.880346
marital_status 1.698048

Top 5 Features:

  • total_amount_spent
  • total_campaigns_accepted
  • NumWebVisitsMonth
  • Recency
  • family_size

Significant Features: income_level and marital_status also show notable impacts on the target variable.

Learning Curve:

The learning curve shows the Logistic Regression model's performance on both training and cross-validation data, with convergence indicating consistent performance on unseen data, suggesting good generalization without significant underfitting or overfitting.

ROC-AUC:

ROC AUC of 0.85 for both class 0 and class 1 indicates good discriminative power. The micro-average ROC AUC confirms that the model performs well on average across all classes.

Precision-Recall Curve:

An average precision (AP) of 0.55 indicates moderate performance in balancing precision and recall across all thresholds.

Confusion Matrix:

The Logistic Regression model achieved the highest true positive rates (11.9%) and true negative rates (65.70%). Given the importance of recall (TPR) for business needs, this model performs well in identifying true positives.

Overall Assessment

Tree-based models did not generalize well, showing high differences between train and test scores. The Logistic Regression model demonstrates strong predictive performance with low differences between train and test results, indicating good generalization. It achieved the best results with the simplest model and fewer features.

Understanding the Trade-Off

High Recall, Lower Precision (business need):

  • Scenario: You send the campaign to a broad audience to capture nearly all potential responders.
  • Pros: Maximum reach, ensuring most potential customers are contacted.
  • Cons: Higher marketing costs, increased likelihood of annoying uninterested customers, lower overall campaign efficiency.

High Precision, Lower Recall:

  • Scenario: You send the campaign to a targeted, highly likely-to-respond audience.
  • Pros: Cost-efficient, higher customer satisfaction due to relevant offers, better use of marketing resources.
  • Cons: Misses some potential responders, possibly reducing the total number of responses.

About

Develop a model to predict which retail customers will respond to a marketing campaign. Logistic Regression shows the best performance.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published