Skip to content

Krunalscorp/Financial_Fraud_Detection_ML

Repository files navigation

Financial_Fraud_Detection_ML


📌 Project Overview

This project focuses on building a machine learning pipeline to detect fraudulent financial transactions. It includes comprehensive data preprocessing, exploratory analysis, feature engineering, model training, evaluation, and hyperparameter tuning. Due to the high class imbalance in fraud detection, the project also applies SMOTE (Synthetic Minority Over-sampling Technique) to improve the performance of classification models.


📊 Dataset Summary

  • Rows: 1000+
  • Columns: 20
  • Target Variable: Is Fraudulent (0: Not Fraudulent, 1: Fraudulent)

Features Include:

  • Transaction attributes: Amount, Time of Day, Velocity
  • Customer details: Age, Income, Credit Score
  • Card info: Card Type, Card Limit
  • Merchant data: Merchant Reputation, Location
  • Behavioral traits: Spending Patterns, Online Transactions Frequency

🔍 Exploratory Data Analysis

Sanity Check

  • Verified presence of 19 feature columns and 1 target.
  • Identified null values in a single row, which was dropped.

Class Imbalance

  • 947 non-fraudulent vs 53 fraudulent transactions.
  • Severe imbalance demands oversampling.

Visual Insights

  • Fraud more prevalent in Prepaid and Credit cards.
  • Higher velocity and amount variations noted in frauds.
  • Age distribution slightly denser for frauds between 30–65.
  • Fraud rates vary by Location and Card Type.

Correlation Analysis

  • No strong linear correlation with target (Is Fraudulent).
  • Indicates need for non-linear models or derived features.

🧹 Data Preprocessing

  • Dropped null rows and unnecessary columns.
  • Applied Z-score normalization on numeric features.
  • One-hot encoding for nominal categorical features.
  • Ordinal encoding for ordered categorical variables:
    • Merchant Reputation: Bad → 0, Average → 1, Good → 2
    • Online Transactions Frequency: Low → 0, Medium → 1, High → 2
  • Converted Date to derived features: DayOfWeek, Month, IsWeekend

🛠️ Feature Selection

  1. Correlation Analysis: No features dropped due to lack of high correlations.
  2. Mutual Information:
    • Top features: MCC Category, Location, Spending Patterns, Balance Before Transaction
  3. Recursive Feature Elimination (RFE):
    • Final 10 features selected based on importance to decision tree model.

🤖 Model Building

Train-Test Split

  • Used an 80/20 stratified split initially.
  • Also tried 90/10 for tuned model evaluations.

Models Used

  • Logistic Regression
  • Decision Tree Classifier

Baseline Results (without SMOTE)

Model Accuracy Fraud Recall Comment
Logistic Regression 94.5% 0.00 Completely failed to detect frauds
Decision Tree 93.5% 0.00 Biased toward majority class

⚖️ Handling Class Imbalance with SMOTE

  • Applied SMOTE to synthetically generate fraud samples.
  • Rebalanced dataset allowed models to detect fraud more effectively.

Post-SMOTE Results (80/20)

Model Accuracy Fraud Recall F1 Score
Logistic Regression 62.3% 63% 0.62
Decision Tree 88.9% 90% 0.89

🔧 Hyperparameter Tuning

Used GridSearchCV on both models:

Best Parameters

  • Logistic Regression: C=1, solver='lbfgs'
  • Decision Tree: max_depth=None, min_samples_split=5, min_samples_leaf=2

Tuned Results (90/10 split with SMOTE)

Model Accuracy Fraud Recall F1 Score
Logistic Regression 63.1% 66% 0.63
Decision Tree 88.4% 93% 0.88

📈 Performance Comparison

A grouped bar chart was generated to compare Logistic Regression and Decision Tree models across three scenarios:

  • Without SMOTE (80/20)
  • With SMOTE (80/20)
  • Tuned with SMOTE (90/10)

✅ Final Summary & Recommendations

Key Insights:

  • SMOTE significantly improves fraud detection.
  • Decision Tree outperforms Logistic Regression in all scenarios.
  • Feature engineering and model tuning are essential for imbalanced classification.

Final Recommendation:

Use Decision Tree for deployment, given its robust fraud detection capability (high precision and recall).


🔮 Future Work

  • Apply ensemble models like Random Forest or XGBoost.
  • Explore cost-sensitive learning to further improve fraud recall.
  • Build a real-time fraud detection API.
  • Integrate model monitoring for drift detection in production.

👤 Author Krunal Patel https://github.com/Krunalscorp

About

This project focuses on building a machine learning pipeline to detect fraudulent financial transactions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published