This project is a comprehensive machine learning pipeline designed for classifying colorectal polyp images. It incorporates data augmentation, feature extraction using DenseNet121, and classification with XGBoost. The goal is to achieve high accuracy and robustness in the classification task, even with imbalanced datasets.
- Overview
- Project Structure
- Installation
- Data Preprocessing
- Feature Extraction
- Model Training
- Evaluation
- Results
- Usage
- Acknowledgments
Colorectal polyps are abnormal growths in the colon that can potentially lead to cancer. Early detection and classification of these polyps are critical for effective treatment. This project employs a combination of deep learning and traditional machine learning techniques to classify colorectal polyp images into two classes.
Key Features:
- Data Augmentation: Ensures a balanced dataset by generating additional images.
- Feature Extraction: Uses DenseNet121 for feature extraction.
- Classification: Implements XGBoost with hyperparameter tuning and class balancing.
.
├── data2/ # Original dataset
├── augmented_data/ # Augmented dataset
├── scripts/ # Python scripts for preprocessing and training
├── xgboost_colorectal_polyp_classifier_augmented.pkl # Saved XGBoost model
└── README.md # Project documentation
- Python 3.7+
- TensorFlow 2.x
- NumPy
- Matplotlib
- XGBoost
- scikit-learn
- imbalanced-learn
-
Clone this repository:
git clone https://github.com/your-repo/colorectal-polyp-classification.git cd colorectal-polyp-classification
-
Install the required dependencies:
pip install -r requirements.txt
-
Place your dataset in the
data2/
directory.
The dataset is augmented to ensure class balance and improve model generalization. Augmentation techniques include:
- Rotation
- Width and height shifting
- Shearing
- Zooming
- Horizontal and vertical flipping
- Brightness adjustment
To augment the dataset:
augment_data(original_data_dir, augmented_data_dir, target_count=4500)
The dataset is split into training, validation, and test sets using ImageDataGenerator
.
Features are extracted from the augmented dataset using DenseNet121, a pre-trained convolutional neural network:
feature_extractor = tf.keras.Model(
inputs=base_model.input,
outputs=tf.keras.layers.GlobalAveragePooling2D()(base_model.output)
)
Extracted features are then used as input for the classifier.
The XGBoost classifier is trained with the extracted features. To address class imbalance, SMOTE (Synthetic Minority Oversampling Technique) is applied to the training data.
Key hyperparameters for XGBoost:
learning_rate=0.03
max_depth=10
n_estimators=300
scale_pos_weight
for class balancing
Training script:
xgb_model = train_xgboost(train_features_resampled, train_labels_resampled)
The model is evaluated using the following metrics:
- Accuracy
- Precision
- Recall (Sensitivity)
- Specificity
- F1 Score
- AUC-ROC
Example:
evaluate_model(xgb_model, val_features, val_labels)
- Accuracy: 0.94
- Precision: 0.94
- Recall: 0.94
- Specificity: 0.93
- F1 Score: 0.94
- AUC: 0.98
- Confusion Matrix:
[[630 44] [ 47 720]]
- Accuracy: 0.92
- Precision: 0.94
- Recall: 0.92
- Specificity: 0.93
- F1 Score: 0.93
- AUC: 0.98
- Confusion Matrix:
[[383 30] [ 39 449]]
-
Train the model:
python train.py
-
Evaluate the model:
python evaluate.py
-
Use the saved model for inference:
from joblib import load model = load("xgboost_colorectal_polyp_classifier_augmented.pkl")
This project leverages the DenseNet121 architecture and XGBoost for state-of-the-art performance. Special thanks to the open-source community for providing tools and frameworks that made this project possible.