This project demonstrates a detailed analysis of credit approval data using machine learning models and data visualization techniques.
- Introduction
- Technologies Used
- Setup and Installation
- Data Preprocessing
- Model Training and Evaluation
- Results and Visualization
- Saving Results
This project analyzes a dataset of credit approvals, leveraging machine learning models to predict credit approval outcomes. The models used include:
- Decision Tree Classifier (with hyperparameter tuning via GridSearchCV)
- Logistic Regression (for comparison)
The dataset was preprocessed to handle missing values and categorical variables. Feature importance and correlations were visualized to gain insights into the data.
- Python 3.x
- Libraries:
numpy
pandas
matplotlib
seaborn
scikit-learn
json
This project was developed using Google Colab. To replicate the analysis:
- Open Google Colab.
- Upload the script file (
Credit_Approval.ipynb
) to your Colab environment. - Upload the dataset (
crx.data
) to the Colab environment. - Run the script cell by cell.
No additional installation is required since Google Colab comes with most dependencies pre-installed. If any libraries are missing, install them using:
!pip install library_name
The dataset contains both numeric and categorical features. Missing values were handled as follows:
- Numeric columns: Replaced with the mean value.
- Categorical columns: Replaced with the most frequent value and encoded using
LabelEncoder
.
Features were scaled using StandardScaler
for better model performance.
-
Decision Tree Classifier
- Hyperparameter tuning with GridSearchCV:
max_depth
min_samples_split
min_samples_leaf
- Achieved an accuracy of X.XXXX.
- Hyperparameter tuning with GridSearchCV:
-
Logistic Regression
- Comparison model.
- Achieved an accuracy of X.XXXX.
Evaluation metrics included:
- Accuracy
- Classification Report
- Confusion Matrix
-
Decision Tree Visualization
- A graphical representation of the optimized decision tree is provided.
-
Feature Importance
- Features were ranked based on their importance in the decision tree model.
-
Correlation Matrix
- Highlighted relationships between features.
-
Histograms
- Showed the distribution of numeric features, categorized by the target class.
Key results were saved to a JSON file (resultados_analise_credito.json
) for easy sharing and further analysis. These include:
- Model accuracies
- Best hyperparameters for the Decision Tree
- Feature importances