Skip to content

A system for symptom extraction and diagnoses prediction based on the MIMIC-IV dataset of medical notes from the Beth Israel Deaconess Medical Center in Boston.

Notifications You must be signed in to change notification settings

ACoullard/clinical-note-predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Clinical Symptom Extraction and Diagnosis Prediction

This project explores the extraction of symptoms from clinical discharge notes (MIMIC-IV dataset) and uses them to predict ICD diagnosis codes. It compares the performance of rule-based/pipeline approaches (scispacy) against purely transformer-based approaches (BERT) for symptom extraction, followed by multi-label classification using machine learning models.

Project Summary

The goal of this project is to automate the identification of symptoms from unstructured clinical text and predict associated diagnosis codes (specifically ICD-10 codes R00-R09: Symptoms and signs involving the circulatory and respiratory systems).

The pipeline consists of:

  1. Preprocessing: Linking MIMIC-IV discharge notes with diagnosis codes.
  2. Symptom Extraction: Extracting clinical entities using two methods:
    • scispacy: A spaCy-based pipeline with biomedical models, enhanced with negation and family history detection.
    • BERT: A transformer-based approach using clinical-distilbert.
  3. Classification: Predicting diagnosis codes using Logistic Regression and Random Forest classifiers based on the extracted symptoms.

Pipeline Stages

1. Data Preprocessing

The first stage involves preparing the MIMIC-IV dataset for analysis. This includes linking discharge notes (discharge.csv) with diagnosis codes (diagnoses_icd.csv) and prescriptions. The data is filtered to focus specifically on ICD-10 codes R00-R09 (Symptoms and signs involving the circulatory and respiratory systems).

2. Symptom Extraction

Clinical entities are extracted from the unstructured text using two distinct approaches:

  • scispacy: A rule-based pipeline using the en_ner_bc5cdr_md model. It is enhanced with negspacy to filter out negated symptoms (e.g., "no fever") and custom logic to handle family history mentions.
  • BERT: A transformer-based approach utilizing the nlpie/clinical-distilbert-i2b2-2010 model to extract problem entities based on a confidence score threshold.

Extracted features are cached to disk to allow for efficient experimentation without re-running the expensive extraction process.

3. Model Training & Evaluation

Once symptoms are extracted, they are vectorized using Multi-Hot Encoding to create a feature set for classification.

  • Classifiers: Logistic Regression and Random Forest models are trained to predict diagnosis codes.
  • Optimization: Decision thresholds are optimized for multi-label classification to maximize the F1-score.
  • Evaluation: Comprehensive reports are generated, including Precision, Recall, F1-score, and Hamming Loss.
  • Files: src/scispacy.ipynb

4. Analysis & Utilities

Tools are provided to inspect the intermediate states of the pipeline, such as the cached feature sets, and to create smaller data subsets for rapid development.

Results

The models were evaluated based on Macro F1-score, Weighted F1-score, and Hamming Loss.

Performance Overview

The scispacy pipeline with negation detection generally outperformed the BERT-based extraction and the baseline scispacy (without negation) for this specific task.

  • Best Performing Configuration: Logistic Regression (L2, C=0.1) using scispacy with negation.
    • Macro F1: 0.48
    • Weighted F1: 0.62
  • BERT Performance: The BERT-based models achieved a Macro F1 of 0.46 and Weighted F1 of 0.59, performing slightly worse than the negation-aware scispacy pipeline.
  • Impact of Negation: Adding negation detection improved the Macro F1 score from ~0.42 (baseline) to ~0.48.
  • Classifier Comparison: Logistic Regression consistently outperformed Random Forest in this high-dimensional, sparse feature space (RF Macro F1 ~0.36).

Visualizations

Per-Label F1 Scores

Performance varies significantly across different diagnosis codes. Common codes like R00 (Abnormalities of heart beat) and R07 (Pain in throat and chest) generally have better prediction performance.

Per Label F1

Precision vs. Recall

The trade-off between precision and recall for different labels.

Precision and Recall

Model Comparison Heatmap

A comparison of different model configurations and their resulting metrics.

Heatmap

Usage

  1. Setup Environment: Install dependencies listed in requirements.in (or the notebook cells).
  2. Preprocess Data: Run src/mimic_preprocessing.ipynb to generate the linked dataset.
  3. Run Pipeline: Open src/scispacy.ipynb.
    • Configure PIPELINE_CONFIG to choose between 'scispacy' or 'bert'.
    • Run the notebook to extract symptoms, train models, and view results.

About

A system for symptom extraction and diagnoses prediction based on the MIMIC-IV dataset of medical notes from the Beth Israel Deaconess Medical Center in Boston.

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •