Existing approaches on Question Answering over Knowledge Graphs (KGQA) have weak generalizability. That is often due to the standard i.i.d. assumption on the underlying dataset. Recently, three levels of generalization for KGQA were defined, namely i.i.d., compositional, zero-shot. We analyze 25 well-known KGQA datasets for 5 different Knowledge Graphs (KGs). We show that according to this definition many existing and online available KGQA datasets are either not suited to train a generalizable KGQA system or that the datasets are based on discontinued and out-dated KGs. Generating new datasets is a costly process and, thus, is not an alternative to smaller research groups and companies. In this work, we propose a mitigation method for re-splitting available KGQA datasets to enable their applicability to evaluate generalization, without any cost and manual effort. We test our hypothesis on three KGQA datasets, i.e., LC-QuAD 1.0, LC-QuAD 2.0 and QALD-9
By analyzing 25 existing KGQA datasets, we spot a huge gap in generalization evaluation of KGQA systems in the Semantic Web community. The main goal of this work is to reuse existing datasets from nearly a decade of research and thus generate new datasets applicable to generalization evaluation. We propose a simple and novel method to achieve this goal, and evaluate the effectiveness of our method and the quality of the new datasets it generate in generalizable KGQA systems.
The table below shows the evaluation result w.r.t. three levels of generalization defined in (Gu et al., 2021).
Dataset | KG | Year | I.I.D. | Compositional | Zero-Shot |
---|---|---|---|---|---|
WebQuestions | Freebase | 2013 | ☑ | ☒ | ☒ |
SimpleQuestions | Freebase | 2015 | ☑ | ☒ | ☒ |
ComplexQuestions | Freebase | 2016 | - | - | - |
GraphQuestions | Freebase | 2016 | ☑ | ☑ | ☒ |
WebQuestionsSP | Freebase | 2016 | ☑ | ☒ | ☒ |
The 30M Factoid QA | Freebase | 2016 | ☑ | ☒ | ☒ |
SimpleQuestionsWikidata | Wikidata | 2017 | ☑ | ☒ | ☒ |
LC-QuAD 1.0 | DBpedia | 2017 | ☑ | ☑ | ☑ |
ComplexWebQuestions | Freebase | 2018 | ☑ | ☒ | ☒ |
QALD-9 | DBpedia | 2018 | ☑ | ☑ | ☑ |
PathQuestion | Freebase | 2018 | - | - | - |
MetaQA | WikiMovies | 2018 | - | - | - |
SimpleDBpediaQA | DBpedia | 2018 | ☑ | ☒ | ☒ |
TempQuestions | Freebase | 2018 | - | - | - |
LC-QuAD 2.0 | Wikidata | 2019 | ☑ | ☑ | ☑ |
FreebaseQA | Freebase | 2019 | - | - | - |
Compositional Freebase Questions | Freebase | 2020 | ☑ | ☑ | ☒ |
RuBQ 1.0 | Wikidata | 2020 | - | - | - |
GrailQA | Freebase | 2020 | ☑ | ☑ | ☑ |
Event-QA | EventKG | 2020 | - | - | - |
RuBQ 2.0 | Wikidata | 2021 | - | - | - |
MLPQ | DBpedia | 2021 | - | - | - |
Compositional Wikidata Questions | Wikidata | 2021 | ☑ | ☑ | ☒ |
TimeQuestions | Wikidata | 2021 | - | - | - |
CronQuestions | Wikidata | 2021 | - | - | - |
The statistics of the original datasets and its counterparts (*) generated by our approach is shown below.
Dataset | Total | Train | Validation | Test | I.I.D. | Compositional | Zero-Shot |
---|---|---|---|---|---|---|---|
QALD-9 | 558 | 408 | - | 150 | 46 | 53 | 51 |
LC-QuAD 1.0 | 5000 | 4000 | - | 1000 | 434 | 559 | 7 |
LC-QuAD 2.0 | 30221 | 24177 | - | 6044 | 4624 | 948 | 472 |
QALD-9* | 558 | 385 | - | 173 | 14 | 41 | 118 |
LC-QuAD 1.0* | 5000 | 3420 | 521 | 1059 | 331 | 1021 | 228 |
LC-QuAD 2.0* | 30221 | 20321 | 3267 | 6633 | 4014 | 3235 | 2651 |
- The datasets are available in
json
format. - All the datasets are stored in the
output_dir
directory, where three sub-directories exist for LC-QuAD 1.0, LC-QuAD 2.0 and QALD-9 respectively. In each dataset directory, there are two sub-directories for its original and new versions respectively.
- rdflib==6.0.2
- datasets==1.16.1
- scikit-learn==1.0.1
- numpy==1.20.3
- pandas==1.3.5
Due to usage of the kgqa_datasets
repository (see link), you need to clone it into the root directory of this project.
In order to ensure reproducibility, we set random_seed
to 42 for all the KGQA datasets (e.g., LC-QuAD 1.0, LC-QuAD 2.0, and QALD-9).
dataset_id
: dataset-qaldinput_path
data_dir/qald/data_sets.jsonoutput_dir
: output_dir/qaldsampling_ratio_zero
: .4sampling_ratio_compo
: .1sampling_ratio_iid
: .1n_splits_compo
: 1n_splits_zero
: 1validation_size
: 0.0
dataset_id
: dataset-lcquadinput_path
data_dir/lcquad/data_sets.jsonoutput_dir
: output_dir/lcquadsampling_ratio_zero
: .6sampling_ratio_compo
: .1sampling_ratio_iid
: .2n_splits_compo
: 1n_splits_zero
: 1
dataset_id
: dataset-lcquad2input_path
data_dir/lcquad2/data_sets.jsonoutput_dir
: output_dir/lcquad2sampling_ratio_zero
: .6sampling_ratio_compo
: .1sampling_ratio_iid
: .2n_splits_compo
: 1n_splits_zero
: 1validation_size
: 0.0
- Prior to re-splitting a given KGQA dataset, first preprocess raw datasets by running the following command:
python preprocess.py --tasks <dataset_name> --data_dir <data_dir> --shuffle True --random_seed 42
- Start to re-split the given dataset by running the following command:
python resplit.py --dataset_id <dataset_id> --input_path <data_dir> --output_dir <output_dir> --sampling_ratio_zero .4 --sampling_ratio_compo .1 --sampling_ratio_iid .1 --random_seed 42 --n_splits_compo 1 --n_splits_zero 1 --validation_size 0.0
Please cite our paper if you use any tool or datasets provided in this repository:
@article{jiang2022knowledge,
title={Knowledge Graph Question Answering Datasets and Their Generalizability: Are They Enough for Future Research?},
author={Jiang, Longquan and Usbeck, Ricardo},
journal={arXiv preprint arXiv:2205.06573},
year={2022}
}
This work is licensed under the Apache 2.0 License - see the LICENSE file for details.