GrailQA (Strongly Generalizable Question Answering)[1] is a new large-scale, high-quality dataset for question answering on knowledge bases (KBQA) on Freebase with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It can be used to test three levels of generalization in KBQA: i.i.d., compositional, and zero-shot. Please see the original paper and check out the link for more details.
This dataset can be downloaded via the link.
Here, the SOTA evaluation results of GrailQA are copied from its original main leaderboard, please check out the link if you want to see more details. All copyrights belong to the authors of this dataset.
Model / System | Year | EM | F1 | Reported by |
---|---|---|---|---|
DECAF (BM25 + FiD-3B) | 2022 | - | 78.7 | Yu et. al. |
TIARA | 2022 | 73.0 | 78.5 | Shu et. al. |
DeCC | 2022 | - | 77.6 | Yu et. al. |
DECAF (BM25 + FiD-large) | 2022 | - | 76.0 | Yu et. al. |
UniParser | 2022 | - | 74.6 | Yu et. al. |
RnG-KBQA (single model) | 2021 | 68.778 | 74.422 | Ye et. al. |
ArcaneQA | 2022 | 63.8 | 73.7 | Shu et. al. |
S2QL (single model) | 2021 | 57.456 | 66.186 | Anonymous |
ReTraCk (single model) | 2021 | 58.136 | 65.285 | Chen et. al. |
BERT+Ranking | 2022 | 50.6 | 58.0 | Shu et. al. |
ArcaneQA (single model) | 2021 | 57.872 | 64.924 | Anonymous |
BERT+Ranking (single model) | 2021 | 50.578 | 57.988 | Gu et. al. |
ChatGPT | 2023 | 46.77 | - | Tan et. al. |
GloVe+Ranking (single model) | 2021 | 39.521 | 45.136 | Gu et. al. |
QGG | 2022 | - | 36.7 | Yu et. al. |
GPT-3.5v3 | 2023 | 35.43 | - | Tan et. al. |
BERT+Transduction (single model) | 2021 | 33.255 | 36.803 | Gu et. al. |
GPT-3.5v2 | 2023 | 30.50 | - | Tan et. al. |
FLAN-T5 | 2023 | 29.02 | - | Tan et. al. |
GPT-3 | 2023 | 27.58 | - | Tan et. al. |
GloVe+Transduction (single model) | 2021 | 17.587 | 18.432 | Gu et. al. |
Model / System | Year | EM | F1 | Reported by |
---|---|---|---|---|
DECAF (BM25 + FiD-3B) | 2022 | - | 81.8 | Yu et. al. |
DECAF (BM25 + FiD-large) | 2022 | - | 79.0 | Yu et. al. |
TIARA | 2022 | 69.2 | 76.5 | Shu et. al. |
DeCC (Anonymous) | 2022 | - | 75.8 | Yu et. al. |
ArcaneQA | 2022 | 65.8 | 75.3 | Shu et. al. |
RnG-KBQA | 2022 | - | 71.2 | Yu et. al. |
RnG-KBQA (single model) | 2021 | 63.792 | 71.156 | Ye et. al. |
ReTraCk (single model) | 2021 | 61.499 | 70.911 | Chen et. al. |
S2QL (single model) | 2021 | 54.716 | 64.679 | Anonymous |
ArcaneQA (single model) | 2021 | 56.395 | 63.533 | Anonymous |
BERT+Ranking (single model) | 2021 | 45.510 | 53.890 | Gu et. al. |
GloVe+Ranking (single model) | 2021 | 39.955 | 47.753 | Gu et. al. |
BERT+Transduction (single model) | 2021 | 31.040 | 35.985 | Gu et. al. |
QGG | 2022 | - | 33.0 | Yu et. al. |
GloVe+Transduction (single model) | 2021 | 16.441 | 18.507 | Gu et. al. |
Model / System | Year | EM | F1 | Reported by |
---|---|---|---|---|
TIARA | 2022 | 68.0 | 73.9 | Shu et. al. |
DeCC(Anonymous) | 2022 | - | 72.5 | Yu et. al. |
DECAF (BM25 + FiD-3B) | 2022 | - | 72.3 | Yu et. al. |
UniParser (Anonymous) | 2022 | - | 69.8 | Yu et. al. |
RnG-KBQA (single model) | 2021 | 62.988 | 69.182 | Ye et. al. |
DECAF (BM25 + FiD-large) | 2022 | - | 68.0 | Yu et. al. |
ArcaneQA | 2022 | 52.9 | 66.0 | Shu et. al. |
S2QL (single model) | 2021 | 55.122 | 63.598 | Anonymous |
ArcaneQA (single model) | 2021 | 49.964 | 58.844 | Anonymous |
BERT+Ranking (single model) | 2021 | 48.566 | 55.660 | Gu et. al. |
ReTraCk (single model) | 2021 | 44.561 | 52.539 | Chen et. al. |
QGG | 2022 | - | 36.6 | Yu et. al. |
GloVe+Ranking (single model) | 2021 | 28.886 | 33.792 | Gu et. al. |
BERT+Transduction (single model) | 2021 | 25.702 | 29.300 | Gu et. al. |
GloVe+Transduction (single model) | 2021 | 2.968 | 3.123 | Gu et. al. |
[1] Gu, Yu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su. Beyond IID: three levels of generalization for question answering on knowledge bases. In Proceedings of the Web Conference 2021, pp. 3477-3488. 2021.