Contact person: Tim Baumgärtner
Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.
To run the experiments, you need to install the following dependencies:
- GROBID 0.8
- Java 21 (for BM25 retrieval experiments with pyserini)
- uv
To set up the environment, you can use the following commands:
# download python version with uv
uv python install 3.10
# create a virtual environment
uv venv .venv
# activate the virtual environment
source .venv/bin/activate
# install the required python packages
uv pip install .
This section describes how to download the data from the different sources and how to preprocess it for the experiments.
- Create a new directory
data
and download and unzip the questions into it
mkdir data && cd data && curl -LO 'https://tudatalib.ulb.tu-darmstadt.de/bitstream/handle/tudatalib/4467/peerqa-data-v1.0.zip?sequence=1&isAllowed=y' && unzip peerqa-data-v1.0.zip && mv peerqa-data-v1.0/* . && rm -rf peerqa-data-v1.0 && cd ..
To adhere to the licenses of the papers, we cannot provide the papers directly. Instead, we provide the steps to download the papers from the respective sources and extract the text from them.
- Download NLPeer data from https://tudatalib.ulb.tu-darmstadt.de/bitstream/handle/tudatalib/3618/nlpeer_v0.zip?sequence=3&isAllowed=y. Unzip it and copy the path to the
data/nlpeer
directory - Download PDFs from OpenReview for ICLR 2022, ICLR 2023, NeurIPS:
uv run download_openreview.py
- Download the EGU PDFS for ESurf, ESD:
uv run download_egu.py
- Download Grobid 0.8.0. Specifically, download the source code and run
./gradlew run
inside thegrobid-0.8.0
directory to start the server. - Extract the text from the PDFs to create
data/papers.jsonl
uv run extract_text_from_pdf.py --nlpeer_path data/nlpeer
Now the data is ready for the experiments.
Once the download and preprocessing steps are completed, the following files should be present in the data
directory:
- papers.jsonl
- qa.jsonl
- qa-augmented-answers.jsonl
- qa-unlabeled.jsonl
Key | Type | Description |
---|---|---|
idx | int | The index of the paper in the dataset |
pidx | int | The index of the paragraph in the paper |
sidx | int | The index of the sentence in the paragraph |
type | str | The type of the content (e.g., title, heading, caption) |
content | str | The content of the paragraph |
last_heading | str | The last heading before the paragraph |
paper_id | str | The unique identifier of the paper, where the first part is the source of the paper (e.g., openreview, egu, nlpeer) and the second part is the venue (e.g. ICLR-2022-conf, ESurf, ESD), and the third part is a unique identifier for the paper |
Key | Type | Description |
---|---|---|
paper_id | str | The unique identifier of the paper; see above for composition |
question_id | str | The unique identifier of the question |
question | str | The question |
raw_answer_evidence | List[str] | The raw evidence that has been highlighed in the PDF by the authors |
answer_evidence_sent | List[str] | The evidence sentences that have been extracted from the raw evidence |
answer_evidence_mapped | List[Dict[str, Union[str, List[int]]]] | The evidence sentences with the corresponding indices in the paper. If a sentence corresponds to multiple sentences in the papers.jsonl file, multiple indices will be provided here. |
answer_free_form | str | The free-form answer provided by the authors |
answerable | bool | Whether the question is answerable according to the authors |
answerable_mapped | bool | Whether the question is answerable according to the authors and it has mapped evidence |
This section describes how to run the retrieval experiments for the PeerQA dataset. We provide the scripts for the Dense & Cross-Encoder, BM25, and ColBERT retrieval models.
- Create the qrels file for sentence-level and paragraph-level retrieval
uv run retrieval_create_qrels.py
The following table provides an overview of the models used for the retrieval experiments along with their respective configurations.
To reproduce the decontextualization experiments, add a --template
argument to the scripts. In the paper we used --template="Title: {title} Paragraph: {content}"
for paragraph chunks (i.e. --granularity=parapgraphs
) and --template="Title: {title} Sentence: {content}"
for sentence chunks (i.e. --granularity=sentences
).
Query Model | Document Model | Similarity Function | Pooling |
---|---|---|---|
facebook/contriever | - | dot | mean_pooling |
facebook/contriever-msmarco | - | dot | mean_pooling |
facebook/dragon-plus-query-encoder | facebook/dragon-plus-context-encoder | dot | first_token |
sentence-transformers/gtr-t5-xl | - | dot | mean_pooling |
naver/splade-v3 | - | dot | splade |
cross-encoder/ms-marco-MiniLM-L-12-v2 | - | cross | - |
- Run the retrieval
uv run retrieval_dense_cross_retrieval.py --query_model=facebook/contriever-msmarco --sim_fn=dot --pooling=mean_pooling --granularity=sentences
- Run the retrieval evaluation
uv run retrieval_evaluate.py --query_model=facebook/contriever-msmarco --sim_fn=dot --granularity=sentences
- Make sure Java 21 is installed. This is required for pyserini.
- Run the data preprocessing, to convert the data to pyserini format.
uv run retrieval_pyserini_preprocess.py --granularity=sentences
- Run the indexing
bash retrieval_pyserini_index.sh sentences
- Run the retrieval
uv run retrieval_pyserini_retrieval.py --granularity=sentences
- Run the retrieval evaluation
uv run retrieval_evalulate.py --query_model=bm25 --sim_fn=sparse --granularity=sentences
Download ColBERTv2 checkpoint from https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz
- Preprocess the data, to convert it to the ColBERT format
uv run retrieval_colbert_preprocess.py --granularity=sentences
- Run the indexing
uv run retrieval_colbert_index.py --granularity=sentences
- Run the search
uv run retrieval_colbert_retrieval.py --granularity=sentences
- Postprocess the search results
uv run retrieval_colbert_postprocess.py --granularity=sentences
- Run the retrieval evaluation
uv run retrieval_evalulate.py --query_model=colbert --sim_fn=maxsim --granularity=sentences
This section describes how to run the answerability experiments for the PeerQA dataset. We provide the scripts for the answerability prediction and evaluation.
- Run the answerability prediction 1.1 For the full-text setting, use the following arguments:
uv run generate.py --model=llama-8B-instruct --prompt_selection=answerability-full-text
1.2 For the RAG setting, use the following arguments:
uv run generate.py --model=llama-8B-instruct --prompt_selection=answerability-rag --context_setting=10
1.3 For the gold setting, use the following arguments:
uv run generate.py --model=llama-8B-instruct --prompt_selection=answerability-rag --context_setting=gold
- Run the answerability evaluation, and set the
--generation_file
argument to the path of the generated answers from the previous step (here we use the rag setup with gold paragraphs as an example).
uv run generations_evaluate_answerability.py --generation_file=out/generations-llama-8B-instruct-8k-answerability-rag-gold.jsonl
To run the answerability task with OpenAI models, use generate_openai.py
instead of generate.py
.
This section describes how to run the answer generation experiments for the PeerQA dataset. We provide the scripts for the answer generation and evaluation.
- Download AlignScore Model and NLTK for evaluation
curl -L https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-large.ckpt?download=true -o AlignScore-large.ckpt
python -c "import nltk; nltk.download('punkt_tab')"
- Run the answer generation 2.1 For the full-text setting, use the following arguments:
uv run generate.py --model=llama-8B-instruct --prompt_selection=full-text
2.2 For the RAG setting, use the following arguments:
uv run generate.py --model=llama-8B-instruct --prompt_selection=rag --context_setting=10
2.3 For the gold setting, use the following arguments:
uv run generate.py --model=llama-8B-instruct --prompt_selection=rag --context_setting=gold
- Run Rouge and AlignScore evaluation and set the
--generation_file
argument to the path of the generated answers from the previous step (here we use the full-text setup as an example).
uv run generations_evaluate_rouge_alignscore.py --generation_file=out/generations-llama-8B-instruct-8k-full-text.jsonl
- Run Prometheus evaluation
uv run generations_evaluate_prometheus.py --generation_file=out/generations-llama-8B-instruct-8k-full-text.jsonl
To run the answer generation task with OpenAI models, use generate_openai.py
instead of generate.py
.
Please use the following citation:
@article{peerqa,
title={PeerQA: A Scientific Question Answering Dataset from Peer Reviews},
author={Tim Baumgärtner and Ted Briscoe and Iryna Gurevych},
year={2025},
eprint={2502.13668},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.13668},
}
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.