Car Insurance Claim Prediction ML

Business Understanding

The task at hand consists in building a classifier for a car insurance company that predicts if a client will claim insurance or not, with the end goal being to develop an ML model that uses customer details as input and outputs binary predictions of whether customers will make insurance claims or not.

The data provided in the dataset_prepared.csv file consists of 1000 records of customer information from the car insurance company. The data set includes the following relevant attributes: gender, age, race, driving experience, education, income, credit score, vehicle details, marital status, number of children, country, post code, annual mileage, vehicle type, speeding violations, driving under influence, and past accidents. The target variable is HAS_CLAIMED_INSURANCE which indicates if a customer has claimed insurance or not.

The finished system will require a classifier model capable of predicting whether or not a customer will claim insurance based on the attributes listed above. The system will take into consideration customer details (input) and will produce a binary prediction (output) of “claim” (1) or “no claim” (0).

This task suits the requirements of a Data Mining Approach due to: its large dataset which provides a substantial amount of data to extract patterns and insights; its complex relationships in insurance claiming decisions which tend to depend of factors such as customer demographics, driving history, and policy details, which are analysed through data mining techniques, thus finding patterns that are hard to find and discern manually; its classification problem since predicting whether or not a customer will claim insure is a classification problem, and a common application of data mining techniques; and its decision support to develop an accurate classifier which will help the car insurance company to assess risks, optimize premiums, and make informed decisions based on customer data.

The following terminology can be found throughout the report:

Attribute: also known as feature or variable, it is a specific characteristic or property of data. Within this report, attribute will refer to the various attributes of the csv file dataset which help describe and categorize the customers.
Binary prediction: prediction or classification with two and only two outcomes or categories. Within this report, binary prediction will refer to the model’s prediction as to whether or not the customer will claim insurance. It will be represented as “1” for “claim” and “0” for “not claim”.
Input: independent variables/features. Within this report, input will refer to the customer attributes provided in the csv file to be used in the model for prediction.
ML Model (or simply Model): a mathematical representation of the relationships between input attributes and the target variable.
Output: prediction or result generated by a ML model. Within this report, output will refer to the binary prediction of whether or not a customer will claim insurance as represented by the variable “HAS_CLAIMED_INSURANCE”.
• Predictive Model: similar to Model. It is a mathematical or statistical model that uses historical data and patterns to make inferences about future outcomes or events by identifying relationships and patterns in the input data and applying them to make predictions upon the output variable.
Target variable: also known as dependent variable or response variable, it is the variable that will be predicted. Within this report, target variable will refer to the “HAS_CLAIMED_INSURANCE” column within the csv file which indicates whether or not a customer has claimed insurance.
Task: the task at hand, to build a classifier ML model that predicts whether or not a customer will claim insurance based on their attributes as listed in a csv file.
Variable: an attribute or feature of a dataset. E.g.: age, gender, country, ...
- Categorical: see Nominal.
- Continuous: it refers to a numeric variable that can be any value within a given range. It can be fractional and/or decimal and it has an infinite number of possible values. E.g.: income, annual mileage …
- Discrete: it refers to a numeric variable that can only take certain values or integers. E.g.: past accidents, speeding violations, …
- Nominal: also known as Categorical, it represents categorised/labelled data without any inherent order or numeric value and is treated as labels that are later used to represent different groups/categories. It can have a finite number of distinct values. E.g.: gender, race, education, …
- Numeric: it refers to data that is represented by numeric values, categorized into continuous and discrete variables. E.g.: credit score, speed violations, …

With regards to project methodology, I will be using the Cross-Industry Standard Process for Data Mining (CRISP-DM) which will ensure a structured and systematic approach to the project, allowing me to conduct reliable analysis and to make informed decision-making based on the output of the predictive model. CRIPS-DM consists of the following 6 stages:

Business Understanding: acknowledging business goals and defining the problem.
Data Understanding: exploring and understanding the data provided. Assess data quality and identify relevant attributes.
Data Preparation: readying data for modelling by addressing missing values, data cleaning, normalisation, feature selection, and data splitting into the training, validating, and testing datasets.
Modelling: choosing the appropriate ML model algorithms, training them using the training dataset, and tuning hyperparameters for optimal performance.
Evaluation: evaluating the model’s performance through appropriate evaluation metrics and selecting the best-performing model.
Deployment: documenting findings, preparing a summary of the project, and presenting said report to stakeholders.

Data Understanding

To aid in building the solution, I have listed the variables found in the dataset alongside their data type and suitability for the project:

Variable	Data Type	Suitability
ID	Nominal - unique identifier for each customer	Not considered, it has no predictive value
AGE	Nominal – categorical age range	Considered, should be treated as discrete nominal variable
GENDER	Nominal – categorical gender	Considered, should be treated as discrete nominal variable
RACE	Nominal – categorical race	Considered, should be treated as discrete nominal variable
DRIVING_EXPERIENCE	Nominal – categorical driving experience range	Considered, should be treated as discrete nominal variable
EDUCATION	Nominal – categorical education level	Considered, should be treated as discrete nominal variable
INCOME	Nominal – categorical income level	Considered, should be treated as discrete nominal variable
CREDIT_SCORE	Numeric – continuous credit score	Considered, should be treated as continuous variable
HAS_OWN_VEHICLE	Nominal – categorical ownership of a vehicle	Considered, should be treated as discrete nominal variable
VEHICLE_YEAR	Nominal – categorical vehicle year range	Considered, should be treated as discrete nominal variable
IS_MARRIED	Nominal – categorical marital status	Considered, should be treated as discrete nominal variable
HAS_CHILDREN	Nominal – categorical presence of children	Considered, should be treated as discrete nominal variable
COUNTRY	Nominal – categorical country of residence	Considered, should be treated as discrete nominal variable
POSTAL_CODE	Nominal – categorical postal code	Not considered, specified to customer location, has no predictive value
ANNUAL_MILEAGE	Numeric – continuous estimated annual mileage	Considered, should be treated as continuous variable
VEHICLE_TYPE	Nominal – categorical type of vehicle	Considered, should be treated as discrete nominal variable
SPEEDING_VIOLATIONS	Numeric – discrete count of speeding violations	Considered, should be treated as discrete variable
DRIVING_UNDER_INFLUENCE	Nominal – categorical driving under the influence	Considered, should be treated as discrete nominal variable
PAST_ACCIDENTS	Numeric – discrete count of past accidents	Considered, should be treated as discrete variable
HAS_CLAIMED_INSURANCE	Nominal – categorical insurance claim status	Target variable, to be predicted

Note that “AGE” could also be classified as an ordinal variable rather than nominal as it represents data with a meaningful order but without a consistent numeric difference between categories. However, since some ML algorithms may treat this ordinal variable as nominal during modelling, it made more sense to consider “AGE” as being nominal since in practice the treatment of ordinal variables varies accordingly to the goals of the project at hand.

In sum, all variables are suitable for the project except for “ID” and “POSTAL_CODE”.

With regards to inputs and outputs, I selected input variables based on the assumption that they may have predictive value in determining whether or not a customer will claim insurance. The output variable will be determined based on the nature of the task – to build a classifier to predict the status of insurance claims – and the model will be training to predict its binary status based on input attributes.

Input	Output
AGE, GENDER, RACE, DRIVING_EXPERIENCE, EDUCATION, INCOME, CREDIT_SCORE, HAS_OWN_VEHICLE, VEHICLE_YEAR, IS_MARRIED, HAS_CHILDREN, COUNTRY, ANNUAL_MILEAGE, VEHICLE_TYPE, SPEEDING_VIOLATIONS, DRIVING_UNDER_INFLUENCE, PAST_ACCIDENTS	HAS_CLAIMED_INSURANCE

These decisions were made with the task goal in mind and will contribute to the building of a predictive model that can effectively classify whether or not a customer will claim car insurance.

Data Preparation

After loading the dataset into a pandas DataFrame using pandas’ ‘pd.read_csv()’, I performed the following data pre-processing steps:

Dropped rows and columns with missing values using ‘dropna()’.
Filled missing values with their respective median using ‘fillna(df.median())’.
Checked for duplicate rows and removed them using ‘drop_duplicates()’.
Corrected mis-typed entries for the GENDER and EDUCATION variables using ‘replace()’. In GENDER, ‘m’ and ‘f’ were replaced with ‘male’ and ‘female’, and in EDUCATION, ‘hs’ and ‘na’ were replaced with ‘high school’ and ‘none’.
For Data Transformation and Scaling, I one-hot encoded discrete nominal variables with ‘pd.get_dummies()’ to represent them as binary indicators, normalized continuous variables with min-max scaling using scikit-learn’s ‘MinMaxScaler()’, and label encoded discrete variables using ‘LabelEnconder()’.

The AGE variable is the one whose histogram shows more obvious changes before and after pre-processing. Both histograms were scaled from 0 to 3000 in jumps of 500.

From the histograms, we can gain the following insights into the effect of data transformation on the variable:

Shape of the Distribution: both before and after histograms have a similar shape to one another, with two peaks at 40-64 and 26-39 which suggests a bimodal distribution. If we were to order the bars from younger age groups to older age groups, we would notice a somewhat bell-shaped histogram, suggesting normal distribution in the data.
Central Tendency: all three values of Mode, Median and Mean are higher before preprocessing and lower after preprocessing.
Spread: in both histograms, there are higher number of customers in the age groups 40-64 and 26-39, suggesting that customers within the age range 26-64 are significantly more likely to own a car than those outside of this range.
Skewness and Kurtosis: for both histograms the condition of Mode < Median < Mean is met, suggesting that the graphs are positively skewed.

Data was split 50/50 using scikit-learn’s ‘train_test_split()’, dividing the data into equally sized test and training datasets. The test dataset was then separated to evaluate the model’s performance on unseen data.

Modelling

I have chosen to build models using Decision Trees, Random Forests and Support Vector Machines (SVM). The following table lists the hyperparameters per model and the effects of each parameter:

Model	Hyperparameters	Effects	Validation
Decision Trees	max_depth	Maximum depth of tree	k-fold cross-validation
	min_samples_split	Minimum number of samples needed to split an internal node
	min_samples_leaf	Minimum number of samples needed to be a leaf node
Random Forests	n_estimators	Number of trees in the forest	Randomized search with cross-validation
	max_depth	Maximum depth of each tree
	min_samples_split	Minimum number of samples needed to split an internal node
	min_samples_leaf	Minimum number of samples needed to be a leaf node
Support Vector Machines (SVM)	C	Regularization parameter	Coarse-to-Fine Search
	kernel	Kernel type used in the algorithm
	gamma	Kernel coefficient

Summary of Decision Tree results:

Summary of Random Forest results:

n_estimators	min_samples_split	min_samples_leaf	max_depth	mean_test_score	std_test_score
100	10	1	0	0.844607654	0.006080036
50	10	2	0	0.841540287	0.00696728
100	5	4	0	0.840925848	0.008252704
100	5	1	10	0.840158742	0.006032502
200	2	2	0	0.839546069	0.007032148
200	10	4	5	0.824359195	0.00974315

Based on these results, the Random Forest model seemed to work better. Running it on all data, we obtain the following metrics:

Accuracy: 0.8386503067484663
Precision: 0.7525562372188139
Recall: 0.7215686274509804
F1 score: 0.7367367367367368

These metrics further provide information on the model’s accuracy, ability to correctly identify positive cases (precision), ability to capture positive cases (recall), and the overall balance between precision and recall (F1 score).

Results and Errors

To evaluate how the final model performs, we test it on the test dataset.

The Random Forest model achieved an accuracy of approximately 0.844 meaning that it correctly predicted the target variable 84.4% of the time during the testing period.

The confusion matrix obtained (left) shows that the model wrongly predicts false positives (120) and false negatives (135) more often than it returns true positives (375) and true negatives (1000). Depending on the problem at hand, these errors can have different implications. On the current task of predicting whether a customer will claim car insurance, we need to have in consideration that false positives can occur due to unnecessary costs (which can lead to additional expenses being allocated to customers who wouldn’t claim insurance) and customer experience (where it might cause frustration or inconvenience for customers who are wrongly mislabelled), whereas false negatives can occur due to missed opportunities (where customers could’ve claimed insurance but did not) and customer satisfaction (where a customer’s potential to be identified as a possible claimant was overlooked). Based on this and my own experience, I would deem false positives as being more hurtful for the insurance company.

The metrics of accuracy, precision, recall and F1 score, as identified in the previous section, also hold valuable insights into the model. Accuracy is ratio of correctly predicted samples to the total number of samples; in this case, the model resulted in an accuracy of approximately 83.87%. Precision is ration of true positive predictions to the total number of positive predictions; in this case, the model resulted in a precision of approximately 75.26%. Recall is ratio of true positive predictions to the total number of actual positive samples; in this case, the model resulted in a recall of approximately 72.16%. F1 score is a balanced measure of the model’s performance by considering both precision and recall, ranging from 0 to 1, where 1 stands for the best performance; in this case, the model resulted in a F1 score of approximately 0.7367, indicating a reasonably balanced performance between precision and recall.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Data_Mining_notebook.ipynb		Data_Mining_notebook.ipynb
LICENSE		LICENSE
README.md		README.md
dataset_prepared.csv		dataset_prepared.csv
random_forest_results.csv		random_forest_results.csv
random_search_results.xlsx		random_search_results.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Car Insurance Claim Prediction ML

Business Understanding

Data Understanding

Data Preparation

Modelling

Results and Errors

About

Releases

Packages

Languages

License

westrany/Car-Insurance-Claim-Prediction-ML

Folders and files

Latest commit

History

Repository files navigation

Car Insurance Claim Prediction ML

Business Understanding

Data Understanding

Data Preparation

Modelling

Results and Errors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages