Skip to content

SMU-DSA211 "Statistical Analysis with R" project - Analysis of cold-call sales of car insurance, focusing on the significant customer factors that would predict a higher chance of success.

Notifications You must be signed in to change notification settings

austinlimjingzhe/coldcallanalysis

 
 

Repository files navigation

Cold Call Analysis

SMU DSA211-Statistical Analysis with R Project
Grade: A-

Executive Summary

We aimed to model cold-calling success rates in the context of selling insurance. Traditional marketing managers often use cold calling as a part of their sales campaign in order to sell insurance policies to new generated leads. However, cold callers traditionally face a success rate of 1-2%, and a model predicting the probability of a successful cold call, given the recipient's characteristics, would increase the success rate of their cold calls, and subsequently the success of their marketing campaign.


Traditionally, a brute force approach is employed which not only wastes organisational resources and time but could affect the long term effectiveness of the caller through call reluctance as callers grow wary of rejection. We believe that we have proven that a data-centric approach to accurately predict the success of cold calls can help alleviate this industry challenge through our model.

The data was sourced from a data mining competition dataset from the Technical University of Munich and is provided by an anonymised bank that sells car insurance to clients through cold calling. Considering that the bank has information regarding prospective clients, the data can be used to optimise the accuracy in identifying clients that are willing and unwilling to purchase car insurance to increase the effectiveness of the bank’s cold calling campaign.

Multiple GLM models created using different selection methods for significant variables and a Classification Tree were evaluated based on their accuracy in identifying clients who were interested in purchasing car insurance. From our evaluation, the most accurate model was the GLM model:

log[p(success)/p(failure)] = -2.14-1.67Entrepreneur1+1.46HHInsurance0-0.62CarLoan-0.08NoOfContacts+1.89PrevSucc1+0.30CallDuration

Our practical recommendations are as follows:

  1. Firstly, callers should try to engage more with their recipients to increase the call duration to over for better chance of success.
  2. Callers should also not continuously approach a recipient multiple times to increase the success of cold calling.
  3. Companies should also not target entrepreneurs as they may be more willing to take risks and thus more unwilling to buy insurance.
  4. Customers who have a car loan may be more unwilling to buy insurance due to increased costs
  5. Subsequently, companies should target customers who are already customers of existing complementary products, in this case, customers who already have household insurance will be more willing to purchase car insurance after the call.
  6. Customers who have bought other products during previous marketing campaigns will also be more likely to buy after the cold call.

We believe that while this predictive model is an important step forward in the direction of digital transformation of traditional marketing, the predictive model can be improved further through actual deployment and obtaining more data tailored to the context of specific companies to increase the predictive accuracy and consistency of the model.

Dataset

The dataset is from 1 bank in the United States which provides car insurance services. The bank organizes regular campaigns to attract new customers and has access to potential customer's data. The dataset has 4000 rows with each row representing a potential customer and whether they purchased car insurance or not. The overview of the features is as follows:

image

We will be omit variables such as ID, Day of contact and Month of contact as we do not believe that any actionable recommendations can be resulted from including them into the analysis.

Data Preparation

library(readr)
coldcall<-read_csv("./carInsurance_train.csv")
str(coldcall)

Loading the data in Rstudio, we notice that the data originally contains some NA rows

Secondly, we realise that the timing of the start and end of a cold call would be more informative in terms of the call duration and hence we used dplyr and lubridate to create the variable called CallDuration.

library(dplyr);library(lubridate)
coldcall<-coldcall%>%
  na.omit()%>%
  mutate(CallDuration=time_length(interval(coldcall$CallStart,coldcall$CallEnd),unit="minute"))%>%
  select(-CallStart,-CallEnd)%>%
  summary()

This results in the output:

image

Exploratory Data Analysis

To automate the EDA process, we made use of the DataExplorer package:
library(DataExplorer)
create_report(coldcall)

This allows us to obtain the report of the dataset to inspect missingness, barcharts, histograms, scatterplots, qq-plots, correlation matrices and much more.

Data Modelling

We attempted several machine learning models including:
  1. simple logistic regression model
  2. best subset selection model
  3. ridge and lasso regression models
  4. decision tree models

To create a simple logistic regression model, we used the following code snippet:

 RNGkind(sample.kind="Rounding")
 set.seed(1)
 train<-sample(1:nrow(coldcall),round(0.8*nrow(coldcall)))
 test<-(-train)
 
 m1<-glm(CarInsurance~.,data=coldcall[train,],family="binomial")
 summary(m1)

Since not every variable is statistically significant, we will re-run the model using only variables that are statistically significant at the 95% confidence level:

 coldcall$Entrepreneur<-as.factor(ifelse(coldcall$Job=="entrepreneur",1,0))
 coldcall$PrevSucc<-as.factor(ifelse(coldcall$Outcome=="success",1,0))
 
 m1_1<-glm(CarInsurance~Entrepreneur+HHInsurance+CarLoan+NoOfContacts+PrevSucc+CallDuration,data=coldcall[train,],family="binomial")
 coef(m1_1)
 
 m1_1.prob<-predict(m1_1,coldcall[test,],type="response")
 m1_1.pred<-rep('0',nrow(coldcall[test,]))
 m1_1.pred[m1_1.prob>0.5]<-'1'
 table(m1_1.pred,coldcall[test,]$CarInsurance)

The confusion matrix of our model is as follows:

image

With this model, we achieve a True Positive Rate of 92.7%, True Negative Rate of 90.3% and an overall error rate of 8.3%.

Conclusion and Discussion

The simple logistic regression model resulted in the best accuracy amongst the models that were attempted. This is possible because the number of variables included in decision tree, best subset, ridge and even lasso regression was greater than that of the logistic regression model using only statistically significant variables. Hence, as the number of variables included is smaller, there is less bias in the model, possibly resulting in a better accuracy.

Given the equation of our model, one conclusion we can draw for example is:

If the duration of the call is increased by 1 minute, the log-odds ratio increases by 0.30. Hence, callers should try to engage more with their recipients to increase the call duration to over for better chance of success
The other recommendations that we have made follow the same line of reasoning.

Our project has several limitations including:

  1. Severe reduction of the dataset as large number of rows had missing values. We could look into using NAs as a category rather than removing rows entirely.
  2. Lack of consideration for interaction or polynomial effects.
  3. Possible confounding factors were not accounted for. For example, household income could affect the willingness to buy insurance and is not accounted by Job.

About

SMU-DSA211 "Statistical Analysis with R" project - Analysis of cold-call sales of car insurance, focusing on the significant customer factors that would predict a higher chance of success.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 100.0%