Clinical Symptom Extraction and Diagnosis Prediction

This project explores the extraction of symptoms from clinical discharge notes (MIMIC-IV dataset) and uses them to predict ICD diagnosis codes. It compares the performance of rule-based/pipeline approaches (scispacy) against purely transformer-based approaches (BERT) for symptom extraction, followed by multi-label classification using machine learning models.

Project Summary

The goal of this project is to automate the identification of symptoms from unstructured clinical text and predict associated diagnosis codes (specifically ICD-10 codes R00-R09: Symptoms and signs involving the circulatory and respiratory systems).

The pipeline consists of:

Preprocessing: Linking MIMIC-IV discharge notes with diagnosis codes.
Symptom Extraction: Extracting clinical entities using two methods:
- scispacy: A spaCy-based pipeline with biomedical models, enhanced with negation and family history detection.
- BERT: A transformer-based approach using clinical-distilbert.
Classification: Predicting diagnosis codes using Logistic Regression and Random Forest classifiers based on the extracted symptoms.

Pipeline Stages

1. Data Preprocessing

The first stage involves preparing the MIMIC-IV dataset for analysis. This includes linking discharge notes (discharge.csv) with diagnosis codes (diagnoses_icd.csv) and prescriptions. The data is filtered to focus specifically on ICD-10 codes R00-R09 (Symptoms and signs involving the circulatory and respiratory systems).

Files: src/mimic_preprocessing.ipynb

2. Symptom Extraction

Clinical entities are extracted from the unstructured text using two distinct approaches:

scispacy: A rule-based pipeline using the en_ner_bc5cdr_md model. It is enhanced with negspacy to filter out negated symptoms (e.g., "no fever") and custom logic to handle family history mentions.
BERT: A transformer-based approach utilizing the nlpie/clinical-distilbert-i2b2-2010 model to extract problem entities based on a confidence score threshold.

Extracted features are cached to disk to allow for efficient experimentation without re-running the expensive extraction process.

Files: src/scispacy.ipynb, src/bert.py

3. Model Training & Evaluation

Once symptoms are extracted, they are vectorized using Multi-Hot Encoding to create a feature set for classification.

Classifiers: Logistic Regression and Random Forest models are trained to predict diagnosis codes.
Optimization: Decision thresholds are optimized for multi-label classification to maximize the F1-score.
Evaluation: Comprehensive reports are generated, including Precision, Recall, F1-score, and Hamming Loss.
Files: src/scispacy.ipynb

4. Analysis & Utilities

Tools are provided to inspect the intermediate states of the pipeline, such as the cached feature sets, and to create smaller data subsets for rapid development.

Files: src/inspector.ipynb, src/make_subset.ipynb

Results

The models were evaluated based on Macro F1-score, Weighted F1-score, and Hamming Loss.

Performance Overview

The scispacy pipeline with negation detection generally outperformed the BERT-based extraction and the baseline scispacy (without negation) for this specific task.

Best Performing Configuration: Logistic Regression (L2, C=0.1) using scispacy with negation.
- Macro F1: 0.48
- Weighted F1: 0.62
BERT Performance: The BERT-based models achieved a Macro F1 of 0.46 and Weighted F1 of 0.59, performing slightly worse than the negation-aware scispacy pipeline.
Impact of Negation: Adding negation detection improved the Macro F1 score from ~0.42 (baseline) to ~0.48.
Classifier Comparison: Logistic Regression consistently outperformed Random Forest in this high-dimensional, sparse feature space (RF Macro F1 ~0.36).

Visualizations

Per-Label F1 Scores

Performance varies significantly across different diagnosis codes. Common codes like R00 (Abnormalities of heart beat) and R07 (Pain in throat and chest) generally have better prediction performance.

Precision vs. Recall

The trade-off between precision and recall for different labels.

Model Comparison Heatmap

A comparison of different model configurations and their resulting metrics.

Usage

Setup Environment: Install dependencies listed in requirements.in (or the notebook cells).
Preprocess Data: Run src/mimic_preprocessing.ipynb to generate the linked dataset.
Run Pipeline: Open src/scispacy.ipynb.
- Configure PIPELINE_CONFIG to choose between 'scispacy' or 'bert'.
- Run the notebook to extract symptoms, train models, and view results.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data/processed/results		data/processed/results
images		images
src		src
.gitignore		.gitignore
README.md		README.md
requirements.in		requirements.in

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clinical Symptom Extraction and Diagnosis Prediction

Project Summary

Pipeline Stages

1. Data Preprocessing

2. Symptom Extraction

3. Model Training & Evaluation

4. Analysis & Utilities

Results

Performance Overview

Visualizations

Per-Label F1 Scores

Precision vs. Recall

Model Comparison Heatmap

Usage

About

Uh oh!

Contributors 2

Uh oh!

Languages

ACoullard/clinical-note-predictor

Folders and files

Latest commit

History

Repository files navigation

Clinical Symptom Extraction and Diagnosis Prediction

Project Summary

Pipeline Stages

1. Data Preprocessing

2. Symptom Extraction

3. Model Training & Evaluation

4. Analysis & Utilities

Results

Performance Overview

Visualizations

Per-Label F1 Scores

Precision vs. Recall

Model Comparison Heatmap

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages