MLPQ: A Dataset for Path Question Answering over Multilingual Knowledge Graphs

Knowledge Graph-based Multilingual Question Answering (KG-MLQA), as one of the essential subtasks in Knowledge Graph-based Question Answering (KGQA), emphasizes that questions on the KGQA task can be expressed in different languages to solve the lexical gap between questions and knowledge graph(s). However, the existing KG-MLQA works mainly focus on the semantic parsing of multilingual questions but ignore the questions that require integrating information from cross-lingual knowledge graphs (CLKG). This paper extends KG-MLQA to Cross-lingual KG-based multilingual Question Answering (CLKGQA) and constructs the first CLKGQA dataset over multilingual DBpedia named MLPQ, which contains 300K questions in English, Chinese, and French. We further propose a novel KG sampling algorithm based on subgraph structural features and obtain KGs for MLPQ, making the evaluated methods compatible with our datasets. To evaluate the dataset, we put forward a general question answering framework whose core idea is to transform CLKGQA into KG-MLQA. We first use the Cross-lingual Entity Alignment (CLEA) model to merge CLKG into a single KG and get the answer to the question by the Multi-hop QA model combined with the Multilingual pre-training model. Then we establish two baselines for MLPQ, one of which uses Google translation to obtain alignment entities, and the other adopts the recent CLEA model. Experiments show that the simple combination of the existing QA and CLEA methods fails to obtain the ideal performances on CLKGQA. Moreover, the availability of our benchmark contributes to the community of question answering and entity alignment.

Datasets

Overview

There are a total of 300K questions in MLPQ, which covers three language pairs (English-Chinese, English/French, and Chinese/French), and requires a 2-hop or 3-hop cross-lingual path inference to answer each question.

Dataset creation

We establish MLPQ through a semi-automatic process shown in the following picture:

Statistics

The statistics of the generated questions, each subset contains English, Chinese, and French versions, with a total scale of 314,479question：

KG pair	Language	2-hop	3-hop	Relation pairs in questions		Average length
KG pair	Language	2-hop	3-hop	2-hop	3-hop	2-hop	3-hop
en-zh	English	14,656	29,815	1,250	2,628	12.4	15.5
	Chinese	14,852	29,643	1,251	2,637	17.2	21.7
	French	15,169	30,360	1,251	2,626	11.3	16.1
en-fr	English	15,289	18,154	1,138	3,575	12.3	15.5
	Chinese	15,831	18,035	1,141	3,578	17.8	21.8
	French	15,867	17,993	1,144	3,580	11.7	14.7
zh-fr	English	8,373	17,800	759	1,674	11.6	16.0
	Chinese	8,414	17,877	758	1,677	17.5	21.4
	French	8,495	17,856	758	1,668	12.1	14.9
Sum	-	116,946	197,533	3,157	9,484	12.2/17.5/11.6	15.6/21.6/15.4
Sum	-	116,946	197,533	3,157	9,484	(English/Chinese/French)

Use of the datasets

The datasets are available in two formats. One is in RDF format, the other is in a custom format similar to the datasets used in IRN.
All the datasets are in the datasets directory. For explanation of file naming convensions and our custom format, please refer to this directory for further information.

Baselines

We established 3 baseline models of MLPQ.
The latest baseline combines NMN and UHop on our latest dataset that have integrated bilingual KGs. It is the one that achieves highest scores on our datasets.
The other 2 older models use MTransE and are tested on the 1.0 version of our datasets:
- MIRN is based on the popular multi-hop reasoning model IRN.
- CL-MKQA is based on a multiple KGQA model
Baseline codes are in the baselines directory. To try these baselines, please refer to this directory for further information.

Versions and future work

Version 1.3 update

By using KGT(https://github.com/bisheng/KTG4KBQG) model, we have generated more paraphrases for the questions in MLPQ. We used these paraphrases to randomly replace 50% of the original questions, which further enhanced the diversity of MLPQ. In version 1.3, we provided the divided set of train/dev/test.

Version 1.2 update

Recreated the datasets to address the diversity problem and the redundancy problem in the datasets. As a result, we now have fewer questions. Also added a new baseline framework combining NMN and UHop with m-BERT.

Version 1.1 update

In this slightly improved version, we corrected many grammatical errors and added the RDF version of all the datasets.

Current version

Currently the MLPQ version is 1.3. We expect to further the work and provide datasets of higher quality and more variety in the future.
Because the generation of MLPQ is semi-automatic and relies on manually crafted templates and machine translation to some degree, there might be some minor problems in the text. We try to improve the quality of MLPQ by post-editing and there should be very few problems now. However, if you find any errors in the dataset, please contact us, thanks.

Future work

For now, MLPQ mainly contains 2-hop and 3-hop path questions. In the future, we plan to adopt retelling generation based on web resources to create a greater abundance of question expressions. The path question is merely one subset of complex questions; we also plan to update and augment factoriented complex questions with property information and to explore aggregate-typed complex questions.

License

This project is licensed under the GPL3 License - see the LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
baselines		baselines
datasets		datasets
resources		resources
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MLPQ: A Dataset for Path Question Answering over Multilingual Knowledge Graphs

Table of contents

Datasets

Overview

Dataset creation

Statistics

Use of the datasets

Baselines

Versions and future work

Version 1.3 update

Version 1.2 update

Version 1.1 update

Current version

Future work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

tan92hl/Dataset-for-QA-over-Multilingual-KG

Folders and files

Latest commit

History

Repository files navigation

MLPQ: A Dataset for Path Question Answering over Multilingual Knowledge Graphs

Table of contents

Datasets

Overview

Dataset creation

Statistics

Use of the datasets

Baselines

Versions and future work

Version 1.3 update

Version 1.2 update

Version 1.1 update

Current version

Future work

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages