This project explores the extraction of symptoms from clinical discharge notes (MIMIC-IV dataset) and uses them to predict ICD diagnosis codes. It compares the performance of rule-based/pipeline approaches (scispacy) against purely transformer-based approaches (BERT) for symptom extraction, followed by multi-label classification using machine learning models.
The goal of this project is to automate the identification of symptoms from unstructured clinical text and predict associated diagnosis codes (specifically ICD-10 codes R00-R09: Symptoms and signs involving the circulatory and respiratory systems).
The pipeline consists of:
- Preprocessing: Linking MIMIC-IV discharge notes with diagnosis codes.
- Symptom Extraction: Extracting clinical entities using two methods:
- scispacy: A spaCy-based pipeline with biomedical models, enhanced with negation and family history detection.
- BERT: A transformer-based approach using
clinical-distilbert.
- Classification: Predicting diagnosis codes using Logistic Regression and Random Forest classifiers based on the extracted symptoms.
The first stage involves preparing the MIMIC-IV dataset for analysis. This includes linking discharge notes (discharge.csv) with diagnosis codes (diagnoses_icd.csv) and prescriptions. The data is filtered to focus specifically on ICD-10 codes R00-R09 (Symptoms and signs involving the circulatory and respiratory systems).
Clinical entities are extracted from the unstructured text using two distinct approaches:
- scispacy: A rule-based pipeline using the
en_ner_bc5cdr_mdmodel. It is enhanced withnegspacyto filter out negated symptoms (e.g., "no fever") and custom logic to handle family history mentions. - BERT: A transformer-based approach utilizing the
nlpie/clinical-distilbert-i2b2-2010model to extract problem entities based on a confidence score threshold.
Extracted features are cached to disk to allow for efficient experimentation without re-running the expensive extraction process.
- Files:
src/scispacy.ipynb,src/bert.py
Once symptoms are extracted, they are vectorized using Multi-Hot Encoding to create a feature set for classification.
- Classifiers: Logistic Regression and Random Forest models are trained to predict diagnosis codes.
- Optimization: Decision thresholds are optimized for multi-label classification to maximize the F1-score.
- Evaluation: Comprehensive reports are generated, including Precision, Recall, F1-score, and Hamming Loss.
- Files:
src/scispacy.ipynb
Tools are provided to inspect the intermediate states of the pipeline, such as the cached feature sets, and to create smaller data subsets for rapid development.
The models were evaluated based on Macro F1-score, Weighted F1-score, and Hamming Loss.
The scispacy pipeline with negation detection generally outperformed the BERT-based extraction and the baseline scispacy (without negation) for this specific task.
- Best Performing Configuration: Logistic Regression (L2, C=0.1) using
scispacywith negation.- Macro F1: 0.48
- Weighted F1: 0.62
- BERT Performance: The BERT-based models achieved a Macro F1 of 0.46 and Weighted F1 of 0.59, performing slightly worse than the negation-aware scispacy pipeline.
- Impact of Negation: Adding negation detection improved the Macro F1 score from ~0.42 (baseline) to ~0.48.
- Classifier Comparison: Logistic Regression consistently outperformed Random Forest in this high-dimensional, sparse feature space (RF Macro F1 ~0.36).
Performance varies significantly across different diagnosis codes. Common codes like R00 (Abnormalities of heart beat) and R07 (Pain in throat and chest) generally have better prediction performance.
The trade-off between precision and recall for different labels.
A comparison of different model configurations and their resulting metrics.
- Setup Environment: Install dependencies listed in
requirements.in(or the notebook cells). - Preprocess Data: Run
src/mimic_preprocessing.ipynbto generate the linked dataset. - Run Pipeline: Open
src/scispacy.ipynb.- Configure
PIPELINE_CONFIGto choose between'scispacy'or'bert'. - Run the notebook to extract symptoms, train models, and view results.
- Configure


