Retrieved sentences for each (question, answer option) pair in three multiple-choice science question answering datasets (i.e., ARC-Easy, ARC-Challenge, and OpenBookQA) from the integrated reference corpus (IRC) plus the integrated external corpus (IEC) described in the paper Improving Question Answering with External Knowledge).
This is a re-implementation. As of the release date of this repository, the Allen Institute for Artificial Intelligence (AI2) disallows third parties to redistribute the ARC Corpus. Therefore, we cannot directly release a resource containing the retrieved sentences from the ARC Corpus. Instead, for all such sentences, we provide pointers to the ARC Corpus as well as a script for fetching the retrieved sentences based on the pointers and your local copy of the corpus.
If you find this resource useful, please cite the following paper.
@inproceedings{pan2019improving,
title={Improving Question Answering with External Knowledge},
author={Pan, Xiaoman and Sun, Kai and Yu, Dian and Chen, Jianshu and
Ji, Heng and Cardie, Claire and Yu, Dong},
booktitle={Proceedings of the Workshop on Machine Reading for Question Answering},
address={Hong Kong, China},
url={https://arxiv.org/abs/1902.00993v2},
year={2019}
}
Below are the detailed instructions.
- Clone this repository.
- Download
ARC-V1-Feb2018.zip
from AI2, unzip it, and copyARC_Corpus.txt
(in the unzipped folderARC-V1-Feb2018-2
) todata
folder. The CRC ofARC_Corpus.txt
should be8CFE08C6
. - Run
python3 gen.py
to generatearc_challenge.json
,arc_easy.json
, andopenbookqa.json
, which are input for models IRC + IEC and IRC + IEC + MD in Table 5 in the paper. The format of these files are as follows.
{
FileName-QuestionID: [
retrieved sentences for the 1st option,
retrieved sentences for the 2nd option,
...
],
...
}
File names and question IDs follow ARC-V1-Feb2018.zip
and OpenBookQA-V1-Sep2018.zip
. Retrieved sentences are splitted by "\n".