Skip to content

333Kenji/Fraud-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fraud Detection With Isolation Forest And Other Common Classifiers

In this project, I explore the implementation and evaluation of several algorithms used for fraud detection. My first and primary model will be created using the isolation forest algorithm which makes use of simple intuition. Used as an ensemble method, it randomly selects a feature to serve as the root of each decision tree and further splits the data based on random values taken from those features. By doing this a norm can be ascertained by finding the average depth required to isolate each point of data and computing an average depth. And since anonymous data does not trend with normal data, these transactions can be detected because their depths are outside of that norm.

Data

alt text

The data for this project is from Kaggle and contains 170885 rows of anonymized data represented by 28 features anonymized as floating-point values along with 'Time', 'Amount', and 'Class'. During EDA I won't find much of interest in 'Time' so that will be dropped, 'Amount' is simply the dollar amount associated with each transaction, and 'Class' is divided into the 360 labelled 1 for fraud, and the rest labelled as 0 for normal.

alt text

Nothing significant in time so I'll drop it and focus on dollar amounts..

alt text

'Amount' appears to be a strong indicator for fraud since the entire range of fraudulent transactions is almost perfectly 10 times smaller than normal transactions. Seeing as this is highly unlikely, I surmise that the dataset is either manufactured or trimmed to these dimensions. In any case, fraudulent transactions are generally much smaller than legitimate ones.

alt text Examining the correlation matrix with regard to data labelled as class 1 we can see that amount carries a bit more variance and some of the strongest correlations as well.

alt text

The data is also extremely imbalanced, with only 360 of 170k, or 0.21%, of transactions being labeled as fraud. This means I'll not be able to rely on accuracy scores to evaluate my models, instead, I'll monitor recall and precision. Also, there are no fraud loss/handling costs provided so finding a tradeoff won't need to be considered here.

alt text Principal component analysis also illustrates outliers.

Models

I chose to start my analysis with the Isolation Forest algorithm which is especially good at finding outliers. As an ensemble method, it randomly selects a feature for each tree's root and also selects a random value along with the range of those features values to split upon. The intuition follows that an outlier to the data will require a much shorter path to its isolation, i.e. a pure leaf. The difference between the length of this path and the average depth required by normal data is then computed as the score. To compare and explore evaluations I also created linear regression and random forest models, which performed much better than isolation forest. Something very curious about this algorithm is that it reads in the h2o data frame format instead of a pandas DataFrame.

Findings

I suspect that a major reason why the random forest and regression models outperformed the isolation forest is because of that h2o data frame format. A way to test this would be to use that library's implementations and data type.

alt text

Isolation Forest

alt text

Random Forest

alt text

Logistic Regression

Still, this was a great opportunity to use ROC and AUROC as evaluation metrics.

alt text

In particular, AUROC can handle highly imbalanced data, and the higher the AUROC score the better, with a score of one being perfect and .5 being the worst. Additionally, I use the Area Under Recall Curve Precision Score (AUCPR) although that particular metric still isn't as capable as the AUC in dealing with imbalanced data.

About

Implementation and Evaluation of Several Classification Algorithms Used For Fraud Detection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published