This repository is the implementation of Characterizing the Value of Information in Medical Notes at Findings of EMNLP2020.
- Environment
- Preprocessing MIMIC-III
- Train models with all information (structured/notes/structured+notes)
- Note Type Comparison
- Note Portion Comparison
- Note Portion Comparison Based on Length (Quartile analysis)
- Troubleshooting
- Contact
# open a new conda with python3.7
conda create -n notes python=3.7
conda activate notes
pip install -r requirements.txt
python -m spacy download en_core_web_sm
First, you need to download MIMIC-III dataset as csv files to your machine.
Then change env variables in scripts.mimic3preprocess.sh
for preprocessing.
export NUM_WORKER=32 # number of threads for multithreading
export DATA_DIR=/data/joe/physician_notes/mimic-data/ #path to your mimic csv files
export OUTPUT_DIR=/data/joe/physician_notes/mimic-data/preprocessed/
Then, you may run bash scripts.mimic3preprocess.sh
. This can take several hours (depends on number of workers) and up to ~600GB storage. Try to cut off unnecessary operations, e.g., 48 hours data, in scripts.mimic3preprocess.sh
to save storage and computation time.
We briefly introduce the functionality of every script as follows:
mimic3preprocess.scripts.extract_subjects
: group records by patient id.mimic3preprocess.scripts.split_train_and_test
: move patient to train/test split based on a given list inmimic3preprocess/resources/testset.csv
.mimic3preprocess.scripts.extract_episodes_from_subjects_multiprocessing
: split admissions of a patient into different episodes.mimic3preprocess.scripts.feature_extraction_multiprocessing
: we convert structured variables into a fixed-length vector with six statistical function and normalization.mimic3preprocess.scripts.merge_features
: we merge structured variables vectors of all patients into a big dictionary (look-up table) to speed up training process.mimic3preprocess.scripts.timeseries_feature_extraction_multiprocessing
: preprocess data into timeseries of structured variables and notes. Also, we add necessary components to train GRU-D model.mimic3preprocess.scripts.create_in_hospital_mortality
andmimic3preprocess.scripts.create_readmission
: create (in-hospital moratlity prediction task) or (readmission prediction task) given a period of records (24hrs/48hrs/retrospective (all)). To align with the main paper, we only process mortality 24 hrs and readmission retro as defult.mimic3preprocess.scripts.get_validation
: split data from previous step into train/val/test.mimic3preprocess.scripts.create_in_hospital_mortality_note
: find admissions that contains a specific set of note types.mimic3preprocess.scripts.get_data_with_notes
: filter out admissions that have no specific set of note in previous step.
After preprocessing, yor should get train/valid/test set as the following files.
# naming format '{args.note}_note_test_{args.period}.csv'
# {note} is for note types used and {period} is the time span of records
# for 24 hrs mortality prediction
OUTPUT_DIR/mortality/all_but_discharge_note_{train/valid/test}_24.csv
# for retro readmission prediction
OUTPUT_DIR/readmission/all_note_{train/valid/test}_retro.csv
Note that we use all notes but discharge summaries for mortality prediction and all notes for readmission prediction.
After finishing following training, check performance here notebooks/results_plots.ipynb
.
We first train logistic regression on two tasks.
# 24hr Mortality prediction
bash scripts/logisitc_regression_mortality.sh
# readmission prediction
bash scripts/logisitc_regression_readmission.sh
First, cd models/DeepAverageNetwork
to working directory.
- Build vocabulary
Change ENV in
scripts/build_vocab.sh
.
# remember to run this command in DeepAverageNetwork dir
bash scripts/build_vocab.sh
- Train models
bash scripts/run_text.sh
bash scripts/run_feature.sh
bash scripts/run_text_feature.sh
We need to build patient2notes table first.
# at base dir. (~10 mins)
python -m processing.find_patient_with_sameNotes -data_dir /data/test_mimic_output/ -period 24
python -m processing.find_patient_with_sameNotes -data_dir /data/test_mimic_output/ -period retro
Change DATA_DIR
in scripts/logistic_regression_compare_notes_pairwise.sh
and run
bash scripts/logistic_regression_compare_notes_pairwise.sh
In this step, we will conduct pairwise comparison between every two types of note on admissions with these two types of note. To make a fair comparison, we downsampling note type having longer tokens to have the same number of tokens as its countpart. Also, we compute mean score over 10 experiment with different random seeds.
Once finishing the script, you can use notebook notebooks/note_comparison_heatmap.ipynb
to visualize the note comparison with heatmap.
TODO: cleaning code
Change DATA_DIR
in scripts/sentence_select_similarity.sh
and scripts/sentence_select.sh
. Then run
bash scripts/sentence_select.sh
bash scripts/sentence_select_similarity.sh
After finishing it, you can make plot in notebooks/heurisitics_group_notes_plot-new.ipynb
.
Change model
in scripts/sentence_select_similarity.sh
and scripts/sentence_select.sh
to LR
.
Change model
in scripts/sentence_select_similarity.sh
and scripts/sentence_select.sh
to DAN
.
We first need to count number of tokens in admissions.
python -m processing.count_token -data_dir PATH_TO_PROCESSED_DATA_DIR -n_worker 60
Then, we will split selected sentences into each quartile.
python -m processing.quartile_split -data_dir PATH_TO_PROCESSED_DATA_DIR
It's fine to have this error FileNotFoundError: [Errno 2] No such file or directory: '/data/test_mimic_output//select_sentence/DAN/mortality'
if you have run previous step with DAN.
Finally, you can now visualize plots in notebooks/heurisitics_group_notes_plot-new.ipynb
.
pandas
version might affectpython -m mimic3preprocess.scripts.extract_subjects $DATA_DIR $OUTPUT_DIR
. Follow the env version if you have the same problem.
Chao-Chun Hsu, [email protected]
@inproceedings{hsu2020characterizing,
title={Characterizing the Value of Information in Medical Notes},
author={Hsu, Chao-Chun and Karnwal, Shantanu and Mullainathan, Sendhil and Obermeyer, Ziad and Tan, Chenhao},
booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings},
pages={2062--2072},
year={2020}
}