Skip to content

A comprehensive diagnostic pipeline that uses symptom data to predict diseases using state-of-the-art machine learning models like SVM and XGBoost.

License

Notifications You must be signed in to change notification settings

tanzealist/PredictMD-SymptomDiseasePredictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PredictMD: Symptom-Based Disease Predictor

Overview

This repository hosts a machine learning project aimed at predicting diseases from a set of symptoms. The project applies several classification techniques, such as Decision Tree, Random Forest, SVM, and XGBoost, to a dataset of symptoms and their corresponding diseases.

Dataset

The dataset contains 132 symptoms as features and a target variable for prognosis, mapping to 42 different diseases. It is split into two CSV files: one for training the models and the other for testing their performance. The features underwent preprocessing and feature selection to identify the most relevant for disease classification.

Top 5 Features for each Target varaible

image

Number of target class variables

image

Data Insights

Top Features for Disease Prediction

Target Label Top 5 Contributing Features
Chronic Cholestasis malaise, chest_pain, excessive_hunger, dizziness, blurred_and_distorted_vision
Drug Reaction irritability, muscle_pain, loss_of_balance, swelling_joints, stiff_neck
Fungal Infection vomiting, chills, skin_rash, joint_pain, itching
GERD nausea, loss_of_appetite, abdominal_pain, yellowing_of_eyes, yellowish_skin
Peptic Ulcer Disease family_history, painful_walking, red_sore_around_nose, stomach_bleeding, coma
Allergy fatigue, high_fever, headache, sweating, cough

Methodology

  1. Data Preprocessing: The raw data was cleaned and preprocessed to prepare for the analysis. This included handling missing values, normalizing, and encoding categorical variables.
  2. Exploratory Data Analysis (EDA): Univariate and multivariate analyses were performed to understand the relationships between features and the prognosis.
  3. Feature Selection: Recursive Feature Elimination (RFE) was utilized to reduce the number of features, focusing on those most impactful for predicting the outcome.
  4. Model Training: The models were trained on the training dataset, using cross-validation techniques to ensure robustness.
  5. Model Evaluation: The trained models were evaluated on a separate testing dataset. The performance metrics include accuracy, precision, recall, and F1-score.

Models Implemented

  • Decision Tree
  • Random Forest
  • Support Vector Machine (SVM)
  • XGBoost

Requirements

This project uses the following Python libraries:

  • Collections
  • Matplotlib
  • NumPy
  • Pandas
  • Seaborn
  • Scikit-learn
  • Warnings
  • XGBoost

Results and Comparison

The following images show the comparison of all models based on their performance metrics and feature importances as assessed by mutual information:

Conclusions

  • Model Performance and Comparision: All models, namely Decision Tree (DT), Random Forest (RF), SVM Linear, and XGBoost, have an almost identical accuracy score on the test data, indicating that they are performing equally well in predicting the target variable.

Screenshot 2024-01-24 at 1 29 09 AM

  • Overfitting: The overfitting score of SVM is lowest, except for XGBoost, which has an overfitting score of 3.086.This indicates that DT, RF, and SVM Linear models are not overfitting, but XGBoost is severely overfitting.

    image

  • Training Accuracy: All models achieved a training accuracy of 1.0, indicating perfect fitting to the training data. However, this may not necessarily translate to performance on unseen data.

  • Model Complexity: The DT model is the simplest, with RF and XGBoost being more complex. SVM Linear has intermediate complexity. Balancing model complexity and performance is crucial to prevent overfitting and ensure good generalization.

  • Model Selection: Although all models perform similarly on the given data, XGBoost's overfitting indicates it may not generalize well. Therefore, careful evaluation using appropriate metrics is important before selecting the final model.

Future Work

  • Further hyperparameter tuning could improve model performances.
  • Investigating additional features and engineering new ones may provide better insights.
  • Expanding the dataset could enhance the model's ability to generalize to new data.

About

A comprehensive diagnostic pipeline that uses symptom data to predict diseases using state-of-the-art machine learning models like SVM and XGBoost.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages