Skip to content

mbodke/Prudential-Life-Insurance--Predictive-Modeling

Repository files navigation

Prudential-Life-Insurance--Predictive-Modeling

Prudential life Insurance Assessment Dataset is an effort towards predicting the risk levels for new and existing customers. Every customer applying for life insurance is assessed and categorized into various risk levels (levels 1 to 8 with 8 as the person with highest risk) based on their family history, health history and other exclusive information recorded during the enrollment process. This process, when done manually, takes an average of 30 days with chances of manual errors and affects the business of the firm. Many customers lose interest because of this lengthy and tedious process. Hence, Prudential wants to make this process of quoting the premium amount time efficient and less labor intensive for its customers, while maintaining privacy boundaries. The main objective of this project is to understand the working of Prudential Life Insurance and the factors involved in getting a quote on premium for an existing or a potential Prudential customer and develop predictive models for identifying various risk levels. The team has worked on a draft version of a model ready dataset containing 128 variables with ordinal target variable ‘Response’ to identify the risk score of a customer and hence will require use of classification models. We followed the SEMMA approach (Sample, Explore, Modify, Model, and Assess) for data analysis. Initially we performed data cleaning procedures like recoding, testing missing values for each variable, correlation analysis for continuous variables and determining the chi-square values to identify the relationship between categorical and target variables. This helped us in reducing dimensionality and complexity of the given dataset. We eliminated 11 variables which had more than 30% of missing data. Using Multivariate outlier detection, we identified 24 records as outliers and excluded them from further analysis. With the help of feature engineering and information gain values for predictors, we could reduce the variable count from 128 to 38. With stratified sample of 60% Training, 20% Validation and 20% Test data, we tried various classification models such as Decision Trees, Neural Networks, Ordinal Regression, Ensemble Decision Trees, Ensemble Regression, Ensemble Neural Networks and Bootstrap Forest. We used Area Under the Curve (AUC), Precision and Recall values to gauge the performance of the model. Based on our analysis, we identified that Bootstrap Forest has the highest accuracy and sensitivity. It is also the most economical model amongst available models. Based on the challenges that we faced, the following are scope for improvement. • The dataset has variables Medical Keywords and Medical History which are very generic. Providing detailed information for these variables will pave way for more feature engineering, which uses data driven insights to identify variables in the dataset that are correlated with the target variable • Variables such as gender and living status of customers are not considered in the study. Presence of such variables will aid models in predicting better/ refined results • ‘Outliers are not always bad, it could help bring variation to the data’. Keeping this fact in mind, we have not eliminated outliers completely. But, few of the multivariate outliers were removed to obtain better model accuracy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published