This project focuses on building a machine learning pipeline to detect fraudulent financial transactions. It includes comprehensive data preprocessing, exploratory analysis, feature engineering, model training, evaluation, and hyperparameter tuning. Due to the high class imbalance in fraud detection, the project also applies SMOTE (Synthetic Minority Over-sampling Technique) to improve the performance of classification models.
- Rows: 1000+
- Columns: 20
- Target Variable:
Is Fraudulent(0: Not Fraudulent, 1: Fraudulent)
- Transaction attributes:
Amount,Time of Day,Velocity - Customer details:
Age,Income,Credit Score - Card info:
Card Type,Card Limit - Merchant data:
Merchant Reputation,Location - Behavioral traits:
Spending Patterns,Online Transactions Frequency
- Verified presence of 19 feature columns and 1 target.
- Identified null values in a single row, which was dropped.
- 947 non-fraudulent vs 53 fraudulent transactions.
- Severe imbalance demands oversampling.
- Fraud more prevalent in Prepaid and Credit cards.
- Higher velocity and amount variations noted in frauds.
- Age distribution slightly denser for frauds between 30–65.
- Fraud rates vary by Location and Card Type.
- No strong linear correlation with target (
Is Fraudulent). - Indicates need for non-linear models or derived features.
- Dropped null rows and unnecessary columns.
- Applied Z-score normalization on numeric features.
- One-hot encoding for nominal categorical features.
- Ordinal encoding for ordered categorical variables:
Merchant Reputation: Bad → 0, Average → 1, Good → 2Online Transactions Frequency: Low → 0, Medium → 1, High → 2
- Converted
Dateto derived features:DayOfWeek,Month,IsWeekend
- Correlation Analysis: No features dropped due to lack of high correlations.
- Mutual Information:
- Top features:
MCC Category,Location,Spending Patterns,Balance Before Transaction
- Top features:
- Recursive Feature Elimination (RFE):
- Final 10 features selected based on importance to decision tree model.
- Used an 80/20 stratified split initially.
- Also tried 90/10 for tuned model evaluations.
- Logistic Regression
- Decision Tree Classifier
| Model | Accuracy | Fraud Recall | Comment |
|---|---|---|---|
| Logistic Regression | 94.5% | 0.00 | Completely failed to detect frauds |
| Decision Tree | 93.5% | 0.00 | Biased toward majority class |
- Applied SMOTE to synthetically generate fraud samples.
- Rebalanced dataset allowed models to detect fraud more effectively.
| Model | Accuracy | Fraud Recall | F1 Score |
|---|---|---|---|
| Logistic Regression | 62.3% | 63% | 0.62 |
| Decision Tree | 88.9% | 90% | 0.89 |
Used GridSearchCV on both models:
- Logistic Regression:
C=1,solver='lbfgs' - Decision Tree:
max_depth=None,min_samples_split=5,min_samples_leaf=2
| Model | Accuracy | Fraud Recall | F1 Score |
|---|---|---|---|
| Logistic Regression | 63.1% | 66% | 0.63 |
| Decision Tree | 88.4% | 93% | 0.88 |
A grouped bar chart was generated to compare Logistic Regression and Decision Tree models across three scenarios:
- Without SMOTE (80/20)
- With SMOTE (80/20)
- Tuned with SMOTE (90/10)
- SMOTE significantly improves fraud detection.
- Decision Tree outperforms Logistic Regression in all scenarios.
- Feature engineering and model tuning are essential for imbalanced classification.
Use Decision Tree for deployment, given its robust fraud detection capability (high precision and recall).
- Apply ensemble models like Random Forest or XGBoost.
- Explore cost-sensitive learning to further improve fraud recall.
- Build a real-time fraud detection API.
- Integrate model monitoring for drift detection in production.
👤 Author Krunal Patel https://github.com/Krunalscorp