Skip to content

leassis91/health-insurance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ₯πŸš— HEALTH INSURANCE CROSS-SELL πŸš—πŸ₯

project_cover_image

πŸ“– Background

The following context is completely fictional.

Insurance All is a company that provides health insurance to its customers and the product team is analyzing the possibility of offering policyholders a new product: vehicle insurance. New customers of this product will pay an amount annually to Insurance All to obtain an amount insured by the company, intended at the costs of an eventual accident or damage to the vehicle.

Insurance All conducted a survey for over 380,000 customers about their interest in joining a new vehicle insurance product last year. The responses results were saved in a database along with other customer attributes.


πŸ“Œ Problem Statement

Ranking customers interested in purchasing vehicle insurance.

The survey obtained feedback from 304,000 customers about their interest in purchasing vehicle insurance. The new insurance was developed and is being offered to the interested parties. However, there are more than 76,000 customers who did not respond to the survey. The already busy call center has the capacity to contact only 20,000 of these potential customers. Therefore, we need a list ordered by interest of these 76k customers, in order to optimize the company's conversion and revenue.


πŸ’Ύ Data Understanding

- Tools Used:

  • Python Version : 3.10
  • Packages : Jupyter, Pandas, Numpy, Matplotlib, Seaborn, Scikit-Learn among others (please, check full list here)
  • Coggle Mindmaps
  • SweetViz
  • Optuna - A hyperparameter optimization framework
  • Frontend API: Google Sheets Script
  • Backend: Heroku

- Importing Dataset:


- Data Dictionary

Variable Descriptions
Id Unique customer identifier.
Gender Customer's gender.
Age Customer's age.
Driving License An indicator for whether a customer has a driving license.
Region Code Customer's region code.
Previously Insured An indicator if a customer already had an auto insurance.
Vehicle Age Customer's vehicle age.
Vehicle Damage An indicator for whether a customer had previous damage in vehicle.
Annual Premium Total annual amount a customer pays for current health insurance.
Policy Sales Channel Anonymous code for a customer contact channel.
Vintage Number of days a customer was associated with the company through the purchase of health insurance.
Response An indicator for auto insurance purchase.

- Business Assumptions

Data Science team have to answer the following questions:

  • What percentage of customers interested in purchasing auto insurance will the sales team be able to contact by making 20,000 calls?
  • If the sales team capacity increases to 40,000 calls, what percentage of customers interested in purchasing auto insurance will the sales team be able to contact?
  • How many calls does the sales team need to make to contact 80% of customers interested in purchasing auto insurance?

We have the following assumptions:

  • policy sales channels used were SMS, e-mail and phone calls.
  • all customers were above minimum drive age.

🧾 Evaluation Metric

Here, we are going to use 2 approaches: Precision and Recall at K combined with the AUC-ROC curve to evaluate our models.

"Precision at K" is the proportion of recommended customers in the top-K set that are relevant. "Recall at K" is the proportion of relevant customers found in the top-K recommendations.

image

ROC is a probability curve and AUC represents the degree or measure of separability. The ROC curve is plotted with TPR (True Positive Rate) against the FPR (False Positive Rate) where TPR is on the y-axis and FPR is on the x-axis.

auc

tpr

fpr


πŸ”¬ Solution Approach

The approach used to solve this task was done by applying CRISP-DMΒΉ methodology, which was divided in the following parts:

  1. Data Description: understanding of the status of the database and dealing with missing values properly. Basic statistics metrics furnish an overview of the data.
  2. Feature Engineering: derivation of new attributes based on the original variables aiming to better describe the phenomenon that will be modeled, and to supply interesting attributes for the Exploratory Data Analysis.
  3. Feature Filtering: filtering of records and selection of attributes that do not contain information for modeling or that do not match the scope of the business problem.
  4. Exploratory Data Analysis (EDA): exploration of the data searching for insights and seeking to understand the impact of each variable on the upcoming machine learning modeling.
  5. Data Preparation: preprocessing stage required prior to the machine learning modeling step.
  6. Feature Selection: selection of the most significant attributes for training the model.
  7. Machine Learning Modeling: implementation of a few algorithms appropriate to the task at hand. In this case, models befitting the regression assignment - i.e., forecasting a continuous value, namely sales.
  8. Hyperparameter Fine Tuning: search for the best values for each of the parameters of the best performing model(s) selected from the previous step.
  9. Statistical Error Analysis: conversion of the performance metrics of the Machine Learning model to a more tangible business result.
  10. Production Deployment: deployment of the model in a cloud environment (Heroku), using Flask connected to our model in a pickle file.
  11. Google Sheets Script show our business results with some customers example in a Google Sheets. Check out at "Deployment" section.

πŸ•΅πŸ½β€β™‚οΈ Exploratory Data Analysis & Main Insights

Hypothesis Creation Map

- Numerical Attributes Correlation

- Categorical Attributes Correlation

- Main Insights

Insights are information that are new or that break beliefs previously established of the business team. They are also actionable, enabling action to drive future results.

  • Hypothesis 1: Higher interest in FEMALE customers.

A: False. We could observe a significant higher interest in male customers.

  • Hypothesis 2: Higher interest in customers who had VEHICLE PREVIOUSLY DAMAGED.

A: True. Almost no one who didn't have your vehicle damaged having any interest in a insurance.

  • Hypothesis 3: Higher interest in LONGER CUSTOMERS.

A: False. Interets didn't show any correlation between old clients.


πŸ’» Machine Learning Modeling & Evaluation

For measuring the performance of the models we will use the cross-validation method which prevents the model from overfitting when the model receives some data that he has never seen before. The @K for the metrics of Learning-To-Rank approach is 20.000 and will better explained in the business results in the next section.

Model Name Accuracy Balanced Precision @K Mean Recall @K Mean ROC AUC Score Top K Score
LGBM Classifier 0.501066 0.307935 0.828112 0.853336 0.877706
Cat Boost Classifier 0.507893 0.305995 0.822895 0.850966 0.876744
XGB Classifier 0.511982 0.305255 0.820905 0.849166 0.876341
Random Forest Classifier 0.542660 0.289416 0.778310 0.829445 0.865662
GaussianNB 0.783939 0.288646 0.776239 0.825829 0.637886
Logistic Regression 0.500000 0.274926 0.739344 0.817501 0.878030
K-Nearest Neighbors Classifier 0.557291 0.268607 0.722349 0.752549 0.856038

In all scenarios, LGBM, CatBoost and XGBoost Classifiers had the best performance, so we chose the model with best size-speed ratio: LGBM model. Then, we proceeded to the Hyperparameter Fine-Tuning step, using Optuna Framework.

Model Name Recall @K Mean ROC AUC Score Top K Score
LGBM Classifier Tuned 0.827252 0.852081 0.877505

Note that for this optimization we used "Recall" as metric to better find positive interested customers.


πŸ“‰ Business Performance

We gathered information to answer the initial business questions.

  • What percentage of customers interested in purchasing auto insurance will the sales team be able to contact by making 20,000 calls?

A: The sales team would be able to reach 61,88% of people interested in purchasing a new car insurance by making 20.000 calls, which correspond to 26.24% of our validation dataset, meaning a performance 2.35 times better than a random choice.

  • If the sales team capacity increases to 40,000 calls, what percentage of customers interested in purchasing auto insurance will the sales team be able to contact?

A: The sales team would be able to reach 99,33% of people interested in purchasing a new car insurance by making 40.000 calls, which correspond to 52.48% of our validation dataset. Our model is 1.89 times better than a random model.

  • How many calls does the sales team need to make to contact 80% of customers interested in purchasing auto insurance?

A: To make contact to the 80% customers interested in purchasing an auto insurance the sales team needs make 26010 calls, which correspond to 34.12% of the validation dataset.


  • Revenue Results

Considering all customers from the validation dataset, we have a total of 76.222 clients. Now, we are going to calculate the revenue that would be generated from this dataset according to the requested business questions with the fixed price of an insurance as U$ 2,000.00 per year and not putting in the matter the cost of each call to reach a client.

For comparison purposes we are taking a "random" model representation, which could be a simple sorting of the list by age or some other feature-specific attribute.

Revenue for 20.000 calls

Model People Reached Total People Interested Revenue
Random Model 2.428 9.523 U$ 4,996,000
Suggested Model 5.892 9.523 U$ 11,784,000
Difference Between Models 3.464 ---- U$ 6,788,000

Revenue for 40.000 calls

Model People Reached Total People Interested Revenue
Random Model 4.997 9.523 U$ 9,994,000
Suggested Model 9.469 9.523 U$ 18,938,000
Difference Between Models 4.472 ---- U$ 8,944,000

Revenue for 80% of interest people of the dataset

To reach the 80% would be necessary 23.350 calls.

Model People Reached Total People Interested Revenue
Random Model 2.916 9.523 U$ 5,832,000
Suggested Model 7.618 9.523 U$ 15,236,000
Difference Between Models 4.702 ---- U$ 9,404,000

πŸ’‘ Conclusions

Gathering the results, we can conclude that making a model to rank best possible clients will clear reduce the sales team cost and effort. Also, within the deployment via spreadsheet, it makes a lot easier to simulate customer profiles, a feature that is of great value to the company.


πŸ‘£ Next steps

  • Collect more data from clients
  • Extract more significant features for next cycle
  • Keep improving model's parameters, since Optuna optimizer uses an improved Random Search method (so not always the best are chosen)

πŸš€ Deployment

Google Sheets

scorevideo

If you can't see properly, right-click on gif and "Open in a new tab" and try to zoom in and have a better visualization.


πŸ”— References

  1. Data Science Process Alliance - What is CRISP-DM
  2. Meaningful Metrics: Cumulative Gains and Lyft Charts
  3. Precision and Recall At K for Recommender Systems

If you have any other suggestion or question, feel free to contact me via LinkedIn.


✍🏽 Author

linkedin gmail kaggle

MIT License


πŸ’ͺ How to contribute

  1. Fork the project.
  2. Create a new branch with your changes: git checkout -b my-feature
  3. Save your changes and create a commit message telling you what you did: git commit -m" feature: My new feature "
  4. Submit your changes: git push origin my-feature

Releases

No releases published

Packages

No packages published