Skip to content

Guidance in Radiology Report Summarization: An Empirical Evaluation and Error Analysis

License

Notifications You must be signed in to change notification settings

jantrienes/inlg2023-radsum

Repository files navigation

Guidance in Radiology Report Summarization: An Empirical Evaluation and Error Analysis

This repository provides the code for following paper:

Jan Trienes, Paul Youssef, Jörg Schlötterer, and Christin Seifert. 2023. Guidance in Radiology Report Summarization: An Empirical Evaluation and Error Analysis. In Proceedings of the 16th International Natural Language Generation Conference (INLG), Prague, Czech Republic. Association for Computational Linguistics.

Contents:

  1. Clone Repository
  2. Computational Environment
  3. Data
    1. Overview
    2. MIMIC
    3. OpenI
  4. Experiments
    1. Training Scripts
    2. Notebooks
    3. RadNLI
  5. Tests and Linting
  6. Citation
  7. Contact

Clone Repository

Code for summarization method is included as submodules. Clone as below.

git clone --recurse-submodules [email protected]:mcmi-group/guided-summary.git

Computational Environment

conda env update -f environment.yml
conda activate guided-summary

pip install -r requirements-dev.txt
pip install -e .

For evaluation, please install ROUGE as per these instructions. Furthermore, build the CheXpert docker image with this script: ./scripts/build_chexpert.sh.

Data

Overview

Artifact Description Link Where to extract
Datasets Use below scripts to download and pre-process the raw MIMIC-CXR and OpenI datasets. see below n/a
Error annotations 1,200 expert annotations (100 reports × 4 candidates × 3 annotators) of MIMIC-CXR test reports. TBA error-analysis/data/
Model outputs Outputs generated by all summarization models. MIMIC-CXR (tba) | OpenI outputs/
Checkpoints Pre-trained models. MIMIC-CXR (tba) | OpenI outputs/

MIMIC

Source: https://physionet.org/content/mimic-cxr/2.0.0/

# When prompted, type your Physionet password...
export PHYSIONET_USER=...
./scripts/preprocess_mimic.sh

# Build PreSumm dataset with finding section
source scripts/config_mimic.sh
./scripts/ds_unguided.sh
./scripts/ds_oracle.sh

# Build PreSumm dataset with background + finding section
source scripts/config_mimic_bg.sh
./scripts/ds_unguided.sh
./scripts/ds_oracle.sh

# Build WGSum datasets
python scripts/convert_to_wgsum.py

CUDA_VISIBLE_DEVICES=0 ./scripts/ds_wgsum.sh data/processed/mimic-wgsum/
CUDA_VISIBLE_DEVICES=1 ./scripts/ds_wgsum.sh data/processed/mimic-bg-wgsum/
CUDA_VISIBLE_DEVICES=2 ./scripts/ds_wgsum.sh data/processed/mimic-official-wgsum/
CUDA_VISIBLE_DEVICES=3 ./scripts/ds_wgsum.sh data/processed/mimic-official-bg-wgsum/

# Build WGSum+CL dataset
CUDA_VISIBLE_DEVICES=0 ./scripts/ds_wgsum_cl.sh data/processed/mimic-wgsum/ data/processed/mimic-wgsum-cl/
CUDA_VISIBLE_DEVICES=1 ./scripts/ds_wgsum_cl.sh data/processed/mimic-bg-wgsum/ data/processed/mimic-bg-wgsum-cl/
CUDA_VISIBLE_DEVICES=2 ./scripts/ds_wgsum_cl.sh data/processed/mimic-official-wgsum/ data/processed/mimic-official-wgsum-cl/
CUDA_VISIBLE_DEVICES=3 ./scripts/ds_wgsum_cl.sh data/processed/mimic-official-bg-wgsum/ data/processed/mimic-official-bg-wgsum-cl/

OpenI

Source: https://openi.nlm.nih.gov/faq#collection

source scripts/config_openi.sh
./scripts/preprocess_openi.sh
./scripts/ds_unguided.sh
./scripts/ds_oracle.sh

# Build PreSumm dataset with background + finding section
source scripts/config_openi_bg.sh
./scripts/ds_unguided.sh
./scripts/ds_oracle.sh

# Build WGSum datasets
python scripts/convert_to_wgsum.py

# Build WGSum dataset
CUDA_VISIBLE_DEVICES=2 ./scripts/ds_wgsum.sh data/processed/openi-wgsum/
CUDA_VISIBLE_DEVICES=3 ./scripts/ds_wgsum.sh data/processed/openi-bg-wgsum/

# Build WGSum+CL dataset
CUDA_VISIBLE_DEVICES=2 ./scripts/ds_wgsum_cl.sh data/processed/openi-wgsum/ data/processed/openi-wgsum-cl/
CUDA_VISIBLE_DEVICES=3 ./scripts/ds_wgsum_cl.sh data/processed/openi-bg-wgsum/ data/processed/openi-bg-wgsum-cl/

Experiments

The code is based on the original PreSumm and GSum implementations. When training for the first time, use only one GPU so that pre-trained models can be downloaded. The training can be restarted after that.

Configure training. Choices = {openi, mimic, mimic_bg, mimic_official, mimic_official_bg}.

source scripts/config_XXXXX.sh

Training Scripts

#### For slurm, prepend following (adapt gpus accordingly)
# sbatch --partition GPUampere --gpus 5 --time 10:00:00 [script]

##### Base Models
# OracleExt
./scripts/train_extoracle.sh

# BertExt (fixed, k=1)
./scripts/train_bertext.sh

# BertAbs
./scripts/train_bertabs.sh

# WGSum + WGSum+CL
./scripts/train_wgsum.sh
./scripts/train_wgsum_cl.sh

# GSum w/ OracleExt
./scripts/train_gsum_oracle.sh

##### GSUM w/ Fixed-Length and Variable-Length Guidance (ours):
# BertExt (fixed, k=[1,5], LR-Approx, BERT-Approx, Thresholding, k=|OracleExt|)
./scripts/train_bertext_allranks.sh
./scripts/train_bertext_thresholds.sh
./scripts/ds_variable.sh

# GSum (oracle-trained) w/ different BertExt strategies
./scripts/test_gsum.sh

# Abstain experiments
./scripts/ds_abstain.sh
./scripts/train_gsum_oracle_abstain.sh
./scripts/test_gsum_abstain.sh

Notebooks

Notebook Purpose Paper Figures/Tables
01-statistics.ipynb Calculate descriptive statistics of the datasets. Table 1, 8
02-evaluation.ipynb Evaluate all model runs. Table 2--6, Figure 3, 5, 6
03-example.ipynb Example report with model outputs. Figure 1
04-error-analysis-assignment.ipynb Prepare reports for error analysis, and assign to annotators. n/a
05-error-analysis-results.ipynb Analysis of manual error annotations. Figure 4, Table 9
06-error-analysis-radnli.ipynb Evaluating the factuality of addition spans with RadNLI (see below). Table 7
07-dataset-inconsistency.ipynb Measuring duplication in MIMIC-CXR and showing examples. Table 11

RadNLI

To run the RadNLI experiment for evaluating factuality of additions, setup below environment:

mamba env update -f radnli_env.yml
conda activate radnli

You also need to download the pre-trained model:

cd ifcc/resources && ./download.sh

After that you can start the experiment using ./notebooks/06-error-analysis-radnli.ipynb

Tests and Linting

To test, lint and autoformat, use following Make targets:

make test
make lint
make format

Citation

If you use the resources in this repository, please cite:

@InProceedings{Trienes:2023:INLG,
    title = "Guidance in Radiology Report Summarization: {A}n Empirical Evaluation and Error Analysis",
    author = {Trienes, Jan  and
      Youssef, Paul  and
      Schl{\"o}tterer, J{\"o}rg  and
      Seifert, Christin},
    booktitle = "Proceedings of the 16th International Natural Language Generation Conference (INLG)",
    year = "2023",
    doi = "10.18653/v1/2023.inlg-main.13",
    pages = "176--195",
}

Contact

If you have any questions, please contact Jan Trienes at jan.trienes [AT] gmail.com.

About

Guidance in Radiology Report Summarization: An Empirical Evaluation and Error Analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published