Skip to content

ohmatheus/Machine-Learning-study-BankClientScoring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine learning study-BankClientScoring

Machine Learning Projet from Kaggle - application of several machine learning techniques to predict the capacity of clients to repay a loan at time.

Table of Contents
  1. About This Project
  2. Roadmap
  3. UNet
  4. Unet Presentation
  5. Contact
  6. references

About This Project

This project is from a competition in Kaggle called 'Home Credit Default Risk' where the goal is to predict how capable each applicant is of repaying a loan: Link to Kaggle Competition

The database used for this project is quite big, and uses several SQL table. Which can be represented by this image :

Our goal is here to identify and/or create variables that will permit us to build and train appropriate machine mearning models so we can predict as much as possible if a client is able to repay a load in time, or not.

Throughout this entiere process, we will use the ROC AUC metrics to compare models, as it is the same metric used by Kaggle for this competition. As a disclaimer, we reached 0.78613 on the ROC Auc for this project, best results are arount 0.805 last time i checked, wich leave us some room for many improvements.

EDA

The database is built upon raw data, meaning we have to clean it. If you want more details you can see the first notebook, but as part of the cleaning we :

  • Took care of missing values by removing them or imputation
  • Encoded (oneHot or Label encoding) some of the categorical strings
  • Took care of anomalies

After that we end up with a total ~350K clients (labelised) and 239 features usable for trainning. We split test/train to something arount 0.2/0.8

Reguarding the proportion of clients who repaid in time and those who didn't, we have :

Meaning we will need to adapt our algorithm to fit those unbalanced data.

Basic correlations

A simple correlation table (check 1st notebook for more infos) reveal that the 5 first features positively correlated to our TARGET are :

  • DAYS_BIRTH : "Client's age in days at the time of application" (Pearson 0.078)
  • DAYS_EMPLOYED : "How many days before the application the person started current employment" (Pearson 0.075)
  • REGION_RATING_CLIENT_W_CITY : "Our rating of the region where client lives with taking city into account (1,2,3)" (Pearson 0.061)
  • REGION_RATING_CLIENT : Our rating of the region where client lives (1,2,3) (Pearson 0.059)
  • NAME_EDUCATION_TYPE (encoded) : "Level of highest education the client achieved" (Pearson 0.054)

And the 5 most negatively correlated features are :

  • EXT_SOURCE_3 : "Normalized score from external data source"
  • EXT_SOURCE_2 : "Normalized score from external data source"
  • EXT_SOURCE_1 : "Normalized score from external data source"
  • AMT_GOODS_PRICE : For consumer loans it is the price of the goods for which the loan is given
  • NAME_EDUCATION_TYPE (encoded) : "Level of highest education the client achieved"

`EXT_SOURCE_X' 1 2 and 3 are data provided by the Home Credit, and we have no intel on their meaning.

DAYS_EMPLOYED

Distribution:

Distribution compared to target:

We clearly have a pattern here. People employed since less time seems to have a lower probability to repay the loan at time.

DAYS_BIRTH

Younger people seems to have a lower chance to repay the loan at time.

NAME_EDUCATION_TYPE

In the same way, people with a higher degree seems to have more chance to repay their loan.

EXT_SOURCE (1, 2, 3)

Let's do a correlation heatmap for those variable:

There is a negative correlation between our Target and all 3 EXT_SOURCE. Plus high correlation between DAYS_BIRTH and EXT_SOURCE_1

Here is the distribution of all EXT_Source_X compared to our Target:

We can see some sort of correlation between those features and the Target.

Feature Engineering

Now we need to create mores variables/features to see if we can get more correlated variable to our Target.

We can :

  • Use PolynomialFeatures from scikit-learn on all Ext_Source_X features -> created some interesting features added to our dataset
  • Create some other features using common sense :
    • CREDIT_INCOME_PERCENT : % of credit cost in comparision of revenus
    • ANNUITY_INCOME_PERCENT : % of yearly credit cost in comparision of revenus
    • CREDIT_TERM : credit time (in month)
    • DAYS_EMPLOYED_PERCENT : ratio between DAYS_EMPLOYED and DAYS_BIRTH

Nothing particularly special, but we still add them into our dataset.

Feature Aggregation, numeric and categorical

We aggregate the features present in all the tables relative to our Target to get the maximum values out of our database. After numerical and categorical aggregation (see 2nd notebook for info),

Note

There is some automatic feature engineering librairies like featuretools that we could use, but to stay simple for now we wont. We also can check different operation on some variables, like derivation (acceleration), to see if it could gives us more correlated features.

Feature Selection

After this engineering we end up with more than 1760 features, we need to reduce that number:

  • Cleaning again missing values, eventual duplicates, and ID feaures
  • removeing colinear variables (>850)

And we are going to select only the features that represents the first 95% of the cumulative importance relative to Target with LightGBM (lightGradientBoosting) :

It's interesting to see that in those feature, we can see the 3 EXT_SOURCE_X features that we already worked with, but also some feature that has been created with aggregation like 'burea u_DAYS_CREDIT_max' wich can be translated as 'the maximum number of days between each credit'

After selecting this 95% of cumulative importance, we end up with ~350 featuers, wich is what we need to train some models.

Note

ACP could have been used for dimentionnality reduction, but not optimal for features explanation ICA Manifold Learning ?

Modeling

That we get to the interesting stuff. We have ~350 features to predict a probability, between 0 and 1. We need to find the model that fit our needs.

Model Selection

Because we are in a unbalanced dataset, we perform an over-sampling using SMOTE, see details in notebook 3.

Then we do a first grid search on 3 model type : LGB, logistic regression and random forest, with some basic arguments for each of them and selecting best cadidates to finally compare those 3 models:

It seems in our case that LGB is the best candidate for our needs. We will now try to optimize this model for our data.

Note

There is a lot of models that could be tested, but because my computationnal resources are limited, i just kept it simple while trying to do things right.

Model Tuning

Throughout this entire process, early stopping is used with LGBM so we don't have to directly deal with the tree number argument. We do a first Baseline using cross validation, giving us a ROC auc of 0.77646 on our test set (std 0.00563)

To search for optimal argument, we do first a random search (check notebook 3 for details), still using SMOTE, trying a lot of differents arguments combination:

After that we perform another grid search, for a more detailed selection of arguments.

At the end we have a ROC AUC of 0.78520, wich is pretty good !

Threashold

Now that we have a model, we can't just say that prediction are split at the 0.5 values, we need a threshold to split our probabilities from 0->1 to a raw classification 0 or 1, the calculated theashold is 0.21:

Feature Explanation

We can simply use the feature importance from LGBM :

But we can also use shap to explain why a client is predicted as good or bad :

Here is an exemple of a good client :

Here is an exemple of a bad client :

Conclusion

We now have a LGB model trainned on our data, which scores 0.78613 on the ROC Auc on the Kaggle web site. We can also explain why a client is predicted as good or bad.

About

Classic Machine learning predictions + Feature Engi

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published