Characterizing the Value of Information in Medical Notes

This repository is the implementation of Characterizing the Value of Information in Medical Notes at Findings of EMNLP2020.

mimic3preprocess.scripts.extract_subjects: group records by patient id.
mimic3preprocess.scripts.split_train_and_test: move patient to train/test split based on a given list in mimic3preprocess/resources/testset.csv.
mimic3preprocess.scripts.extract_episodes_from_subjects_multiprocessing: split admissions of a patient into different episodes.
mimic3preprocess.scripts.feature_extraction_multiprocessing: we convert structured variables into a fixed-length vector with six statistical function and normalization.
mimic3preprocess.scripts.merge_features: we merge structured variables vectors of all patients into a big dictionary (look-up table) to speed up training process.
mimic3preprocess.scripts.timeseries_feature_extraction_multiprocessing: preprocess data into timeseries of structured variables and notes. Also, we add necessary components to train GRU-D model.
mimic3preprocess.scripts.create_in_hospital_mortality and mimic3preprocess.scripts.create_readmission: create (in-hospital moratlity prediction task) or (readmission prediction task) given a period of records (24hrs/48hrs/retrospective (all)). To align with the main paper, we only process mortality 24 hrs and readmission retro as defult.
mimic3preprocess.scripts.get_validation: split data from previous step into train/val/test.
mimic3preprocess.scripts.create_in_hospital_mortality_note: find admissions that contains a specific set of note types.
mimic3preprocess.scripts.get_data_with_notes: filter out admissions that have no specific set of note in previous step.

After preprocessing, yor should get train/valid/test set as the following files.

# naming format '{args.note}_note_test_{args.period}.csv'
# {note} is for note types used and {period} is the time span of records

# for 24 hrs mortality prediction
OUTPUT_DIR/mortality/all_but_discharge_note_{train/valid/test}_24.csv

# for retro readmission prediction  
OUTPUT_DIR/readmission/all_note_{train/valid/test}_retro.csv

3. Train models with all information (structured/notes/structured+notes)

Note that we use all notes but discharge summaries for mortality prediction and all notes for readmission prediction. After finishing following training, check performance here notebooks/results_plots.ipynb.

Logistic regression

We first train logistic regression on two tasks.

# 24hr Mortality prediction
bash scripts/logisitc_regression_mortality.sh

# readmission prediction
bash scripts/logisitc_regression_readmission.sh

Deep Averaging Networks (DAN)

First, cd models/DeepAverageNetwork to working directory.

Build vocabulary Change ENV in scripts/build_vocab.sh.

# remember to run this command in DeepAverageNetwork dir
bash scripts/build_vocab.sh

Train models

bash scripts/run_text.sh 
bash scripts/run_feature.sh 
bash scripts/run_text_feature.sh

4. Note Type Comparison

We need to build patient2notes table first.

# at base dir. (~10 mins)
python -m processing.find_patient_with_sameNotes -data_dir /data/test_mimic_output/ -period 24
python -m processing.find_patient_with_sameNotes -data_dir /data/test_mimic_output/ -period retro

Logistic Regression

Change DATA_DIR in scripts/logistic_regression_compare_notes_pairwise.sh and run

bash scripts/logistic_regression_compare_notes_pairwise.sh

In this step, we will conduct pairwise comparison between every two types of note on admissions with these two types of note. To make a fair comparison, we downsampling note type having longer tokens to have the same number of tokens as its countpart. Also, we compute mean score over 10 experiment with different random seeds.

Once finishing the script, you can use notebook notebooks/note_comparison_heatmap.ipynb to visualize the note comparison with heatmap.

Deep Averaging Networks

TODO: cleaning code

5. Note Portion Comparison

Change DATA_DIR in scripts/sentence_select_similarity.sh and scripts/sentence_select.sh. Then run

bash scripts/sentence_select.sh
bash scripts/sentence_select_similarity.sh

After finishing it, you can make plot in notebooks/heurisitics_group_notes_plot-new.ipynb.

Logistic Regression

Change model in scripts/sentence_select_similarity.sh and scripts/sentence_select.sh to LR.

Deep Averaging Networks

Change model in scripts/sentence_select_similarity.sh and scripts/sentence_select.sh to DAN.

6. Note Portion Comparison Based on Length (Quartile analysis)

We first need to count number of tokens in admissions.

python -m processing.count_token -data_dir PATH_TO_PROCESSED_DATA_DIR -n_worker 60

Then, we will split selected sentences into each quartile.

python -m processing.quartile_split -data_dir PATH_TO_PROCESSED_DATA_DIR

It's fine to have this error FileNotFoundError: [Errno 2] No such file or directory: '/data/test_mimic_output//select_sentence/DAN/mortality' if you have run previous step with DAN. Finally, you can now visualize plots in notebooks/heurisitics_group_notes_plot-new.ipynb.

Troubleshooting

pandas version might affect python -m mimic3preprocess.scripts.extract_subjects $DATA_DIR $OUTPUT_DIR. Follow the env version if you have the same problem.

Contact

Chao-Chun Hsu, chaochunh@uchicago.edu

@inproceedings{hsu2020characterizing,
  title={Characterizing the Value of Information in Medical Notes},
  author={Hsu, Chao-Chun and Karnwal, Shantanu and Mullainathan, Sendhil and Obermeyer, Ziad and Tan, Chenhao},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings},
  pages={2062--2072},
  year={2020}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Characterizing the Value of Information in Medical Notes

Table of Contents

1. Environment

2. Preprocessing MIMIC-III

3. Train models with all information (structured/notes/structured+notes)

Logistic regression

Deep Averaging Networks (DAN)

4. Note Type Comparison

Logistic Regression

Deep Averaging Networks

5. Note Portion Comparison

Logistic Regression

Deep Averaging Networks

6. Note Portion Comparison Based on Length (Quartile analysis)

Troubleshooting

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

Characterizing the Value of Information in Medical Notes

Table of Contents

1. Environment

2. Preprocessing MIMIC-III

3. Train models with all information (structured/notes/structured+notes)

Logistic regression

Deep Averaging Networks (DAN)

4. Note Type Comparison

Logistic Regression

Deep Averaging Networks

5. Note Portion Comparison

Logistic Regression

Deep Averaging Networks

6. Note Portion Comparison Based on Length (Quartile analysis)

Troubleshooting

Contact