Skip to content

[NeurIPS 2023] CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews

License

Notifications You must be signed in to change notification settings

WojciechKusa/systematic-review-datasets

Repository files navigation

CSMeD: Citation Screening Meta-Dataset for systematic review automation evaluation

This package serves as basis for the paper: "CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews" by Wojciech Kusa, Oscar E. Mendoza, Matthias Samwald, Petr Knoth, Allan Hanbury (2023)

https://proceedings.neurips.cc/paper_files/paper/2023/hash/4962a23916103301b27bde29a27642e8-Abstract-Datasets_and_Benchmarks.html


Table of Contents

  1. CSMeD: Title and abstract screening datasets
  2. CSMeD-FT: Full-text screening dataset
  3. Installation
  4. Examples
  5. Visualisations
  6. Experiments

Original datasets used to create CSMeD are described in the table below:

Introduced in # reviews Domain Avg. size Avg. ratio of included (TA) Avg. ratio of included (FT) Additional data Data URL Cochrane Publicly available Included in CSMeD
1 Cohen et al. (2006) 15 Drug 1,249 7.7% Web
2 Wallace et al. (2010) 3 Clinical 3,456 7.9% GiitHub
3 Howard et al. (2015) 5 Mixed 19,271 4.6% Supplementary
4 Miwa et al. (2015) 4 Social science 8,933 6.4%
5 Scells et al. (2017) 93 Clinical 1,159 1.2% Search queries GitHub
6 CLEF TAR 2017 50 DTA 5,339 4.4% Review protocol GitHub
7 CLEF TAR 2018 30 DTA 7,283 4.7% Review protocol GitHub
8 CLEF TAR 2019 49 Mixed** 2,659 8.9% Review protocol GitHub
9 Alharbi et al. (2019) 25 Clinical 4,402 0.4% Review updates GitHub
10 Hannousse et al. (2022) 7 Computer Science 340 11.7% Review protocol GitHub

TA stands for Title + Abstract screening phase, FT for Full-text screening phase. Avg. size describes the size of a review in terms of the number records retrieved from the search query. Avg. ratio of included (TA) describes the average ratio of included records in the TA phase. Avg. ratio of included (FT) describes the average ratio of included records in the FT phase.

CSMeD datasets

CSMeD beyond offering unified access to the original datasets, provides a unified meta-dataset containing all the original datasets. Statistics of the CSMeD datasets are presented in the table below.

Dataset name #reviews #docs #included Avg. #docs Avg. %included Avg. #words in document
CSMeD-basic
CSMeD-basic-train 30 128,438 7,958 4,281 9.6% 229
CSMeD-cochrane
CSMeD-cochrane-train 195 372,422 7,589 1,910 21.9% 180
CSMeD-cochrane-dev 100 229,376 4,365 2,294 20.8% 201
CSMeD-all 325 730,236 19,912 2,247 20.5% 195
Dataset name #reviews #docs. #included %included Avg. #words in document Avg. #words in review
CSMeD-FT-train 148 2,053 904 44.0% 4,535 1,493
CSMeD-FT-dev 36 644 202 31.4% 4,419 1,402
CSMeD-FT-test 29 636 278 43.7% 4,957 2,318
CSMeD-FT-test-small 16 50 22 44.0% 5,042 2,354

Column '#docs' refers to the total number of documents included in the dataset and '#included' mentions number of included documents on the full-text step. CSMeD-test-small is a subset of CSMeD-test.

Requirements

Assuming you have conda installed, to create environment for loading CSMeD run:

$ conda create -n csmed python=3.10
$ conda activate csmed
(csmed)$ pip install -r requirements.txt

Data acquisition prerequisites

To obtain the metadata for CSMeD-Cochrane datasets, you need to configure the cookie for the Cochrane Library website.

Furthermore, to obtain full-text PDFs for CSMeD-FT, you need to configure the following:

  1. SemanticScholar API key: https://www.semanticscholar.org/product/api
  2. CORE API key: https://core.ac.uk/services/api
  3. GROBID: https://grobid.readthedocs.io/en/latest/Install-Grobid/

If you have all the prerequisites, run:

(csmed)$ python confgure.py

And follow the prompts providing API keys, cookies, email address to use PubMed Entrez APIs and paths to GROBID server. You don't need to provide all the information, the bare minimum to construct the datasets is the cookie from Cochrane Library and the email address for PubMed Entrez.

Downloading raw full-text datasets

First install additional requirements:

(csmed)$ pip install -r dev-requirements.txt

To download the datasets, run:

(csmed)$ python scripts/prepare_full_texts.py

Examples presenting how to use the datasets are available in the notebooks/ directory.

To run visualisations first you need to install additional requirements:

(csmed)$ pip install -r vis-requirements.txt

Then you can run the visualisations using streamlit:

(csmed)$ streamlit run visualisation/_🏠_Home.py.py

Baseline experiments from the paper are described in the at: WojciechKusa/CSMeD-baselines repository.