Machine Reading Comprehension (MRC) is a crucial and challenging task in natural language processing (NLP). In recent years, with the development of deep learning and the increasing availability of large-scale datasets, MRC has achieved remarkable advancements. Although a number of MRC models obtains human parity performance on several datasets, we find that these models are still far from robust in the scenario of real-world applications.
To comprehensively evaluate the robustness of MRC models, we create a Chinese dataset, namely DuReaderrobust. It is designed to challenge MRC models from the following aspects: (1) over-sensitivity, (2) over-stability and (3) generalization. Besides, DuReaderrobust has another advantage over previous datasets: questions and documents are from Baidu Search. It presents the robustness issues of MRC models when applying them to real-world applications.
For more details about the dataset, please refer to this paper.
Note that we release two equivalent versions of the baseline for DuReaderrobust dataset. One is the DuReaderrobust Baseline System and the other is PaddleHub Baseline. The two baseline systems implement the same ERNIE 1.0 based MRC model, while the main differences are
- DuReaderrobust Baseline System is more flexible and unconstrained to secondary development.
- PaddleHub Baseline is easier to get started as PaddleHub integrates some simplified fine-tune api packaged for ERNIE.
In this repository, we release a baseline system for DuReaderrobust dataset. The baseline system is based on ERNIE 1.0, and is implemented with PaddlePaddle framework. To run the baseline system, please follow the instructions below.
The baseline system has been tested on
- CentOS 6.3
- PaddlePaddle 1.6.1
- Python 2.7.13 & Python 3.7.3
- Cuda 9.0
- CuDnn 7.0
- NCCL 2.2.13
To install PaddlePaddle, please see the PaddlePaddle Homepage for more information.
Before run the baseline system, please download the DuReaderrobust dataset and the pretrained model parameters (ERNIE 1.0 base):
sh download.sh
The dataset will be saved into data/
, the pretrained and fine-tuned model parameters will be saved into pretrained_model/
and finetuned_model/
, respectively. Note that the fine-tuned model was fine-tuned from the training data based on the pretrained model (ERNIE 1.0).
The descriptions of the data structure can be found in data/README.md
.
To fine-tune a model (on the demo dataset), please run the following command:
sh train.sh --train_file data/demo/demo_train.json --predict_file data/demo/demo_dev.json
This will start the training process on the demo dataset. At the end of training, model parameters and predictions will be saved into output/
.
To train the model with user specified arguments (e.g. multi-GPUs, more epochs and larger batch size), please run the following command:
CUDA_VISIBLE_DEVICES=0,1,2,3 sh train.sh --epoch 5 --batch_size 8 --train_file data/demo/demo_train.json --predict_file data/demo/demo_dev.json
You can also directly modify the arguments in train.sh
. More arguments can be found in src/run_mrc.py
.
To predict with fine-tuned parameters, (e.g. on the devlopment set), please run the following command:
CKPT=<your_model_dir> sh predict.sh --predict_file data/dev.json
The model parameters under your_model_dir
(e.g. finetuned_model
as we have provided) will be loaded for prediction. The predicted answers will be saved into output/
.
F1-score and exact match (EM) are used as evaluation metrics. Here we provide a script evaluate.py
for evaluation.
To evluate, run
python evaluate.py <reference_file> <prediction_file>
Where <reference_file>
is the dataset file (e.g. data/dev.json
), and <prediction_file>
is the model prediction (e.g. output/dev_predictions.json
) that should be a valid JSON file of (qid, answer) pairs, for example:
{
"question_id_1" : "answer for question 1",
...
"question_id_N": "answer for question N"
}
After runing the evaluation script, you will get the evaluation results with the following format:
{"F1": "80.842", "EM": "69.019", "TOTAL": 1417, "SKIP": 0}
The performance of our baseline model (i.e. the fine-tuned model we provided above) are shown below:
Dataset | F1 | EM |
---|---|---|
basic dev | 80.84 | 69.02 |
To get the competition started quickly, we also release the baseline based on PaddleHub (https://github.com/PaddlePaddle/PaddleHub). PaddleHub is an application toolkit for pre-trained models of PaddlePaddle(『飞桨』预训练模型应用工具).
What's your need is to install the latest paddlehub toolkit by the command
pip install --upgrade paddlehub
Then, run
cd paddlehub_baseline
sh paddlehub_reading_comprehension.sh
You will fine-tune the ERNIE pretrained model on the competition dataset and get the final prediction results.
Now, you can submit the results to the competition.
All questions about PaddleHub, you can raise an issue (https://github.com/PaddlePaddle/PaddleHub/issues). We will reply as soon as possible.
The same codes also are released on the demo.
Copyright 2020 Baidu.com, Inc. All Rights Reserved
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.