This is the PyTorch implementation of the FEVER pipeline baseline described in the NAACL2018 paper: FEVER: A large-scale dataset for Fact Extraction and VERification.
Unlike other tasks and despite recent interest, research in textual claim verification has been hindered by the lack of large-scale manually annotated datasets. In this paper we introduce a new publicly available dataset for verification against textual sources, FEVER: Fact Extraction and VERification. It consists of 185,441 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as Supported, Refuted or NotEnoughInfo by annotators achieving 0.6841 in Fleiss κ. For the first two classes, the annotators also recorded the sentence(s) forming the necessary evidence for their judgment. To characterize the challenge of the dataset presented, we develop a pipeline approach using both baseline and state-of-the-art components and compare it to suitably designed oracles. The best accuracy we achieve on labeling a claim accompanied by the correct evidence is 31.87%, while if we ignore the evidence we achieve 50.91%. Thus we believe that FEVER is a challenging testbed that will help stimulate progress on claim verification against textual sources
The baseline model constists of two components: Evidence Retrieval (DrQA) + Textual Entailment (Decomposable Attention).
- Visit http://fever.ai to find out more about the shared task and download the data.
- Docker Install
- Manual Install
- Download Data
- Data Preparation
- Train
- Evaluate
- Score and Upload to Codalab
This was tested and evaluated using the Python 3.6 verison of Anaconda 5.0.1 which can be downloaded from anaconda.com
Mac OSX users may have to install xcode before running git or installing packages (gcc may fail). See this post on StackExchange
Support for Windows operating systems is not provided.
To train the Decomposable Attention models, it is highly recommended to use a GPU. Training will take about 3 hours on a GTX 1080Ti whereas training on a CPU will take days. We offer a pre-trained model.tar.gz that can be downloaded. To use the pretrained model, simply replace any path to a model.tar.gz file with the path to the file you downloaded. (e.g. logs/da_nn_sent/model.tar.gz
could become ~/Downloads/model.tar.gz
)
- v0.2 - updated the Information Retrieval component to use a modified version of DrQA that allows multi-threaded document/sentence retrieval. This yields a >10x speed-up the in IR stage of the pipeline as I/O waits are no longer blocking computation of TF*IDF vectors
- v0.1 - original implementation (tagged as naacl2018)
Download and run the latest FEVER.
docker volume create fever-data
docker run -it -v fever-data:/fever/data sheffieldnlp/fever-baselines
To enable GPU acceleration (run with --runtime=nvidia
) once NVIDIA Docker has been installed
Installation using docker is preferred. If you are unable to do this, you can manually create the python environment following instructions here: Wiki/Manual-Install
Remember that if you manually installed, you should run source activate fever
and cd
to the directory before you run any commands.
To download a pre-processed Wikipedia dump (license):
bash scripts/download-processed-wiki.sh
Or download the raw data and process yourself
bash scripts/download-raw-wiki.sh
bash scripts/process-wiki.sh
Download the FEVER dataset from our website into the data directory:
bash scripts/download-data.sh
(note that if you want to replicate the paper, run scripts/download-paper.sh
instead of scripts/download-data
).
Download pretrained GloVe Vectors
bash scripts/download-glove.sh
Sample training data for the NotEnoughInfo class. There are two sampling methods evaluated in the paper: using the nearest neighbour (similarity between TF-IDF vectors) and random sampling.
#Using nearest neighbor method
PYTHONPATH=src python src/scripts/retrieval/document/batch_ir_ns.py --model data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz --count 1 --split train
PYTHONPATH=src python src/scripts/retrieval/document/batch_ir_ns.py --model data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz --count 1 --split dev
Or random sampling
#Using random sampling method
PYTHONPATH=src python src/scripts/dataset/neg_sample_evidence.py data/fever/fever.db
We offer a pretrained model that can be downloaded by running the following command:
bash scripts/download-model.sh
Skip to evaluation if you are using the pretrained model.
Train the Decomposable Attention model
#if using a CPU, set
export CUDA_DEVICE=-1
#if using a GPU, set
export CUDA_DEVICE=0 #or cuda device id
Then either train the model with Nearest-Page Sampling for the NEI class
# Using nearest neighbor sampling method for NotEnoughInfo class (better)
PYTHONPATH=src python src/scripts/rte/da/train_da.py data/fever/fever.db config/fever_nn_ora_sent.json logs/da_nn_sent --cuda-device $CUDA_DEVICE
mkdir -p data/models
cp logs/da_nn_sent/model.tar.gz data/models/decomposable_attention.tar.gz
Or with Random Sampling for the NEI class
# Using random sampled data for NotEnoughInfo (worse)
PYTHONPATH=src python src/scripts/rte/da/train_da.py data/fever/fever.db config/fever_rs_ora_sent.json logs/da_rs_sent --cuda-device $CUDA_DEVICE
mkdir -p data/models
cp logs/da_rs_sent/model.tar.gz data/models/decomposable_attention.tar.gz
The MLP model can be trained following instructions from the Wiki: Wiki/Train-MLP
These instructions are for the decomposable attention model. The MLP model can be evaluated following instructions from the Wiki: Wiki/Evaluate-MLP
Run the oracle evaluation for the Decomposable Attention model on the dev set (requires sampling the NEI class for the dev dataset - see Data Preparation)
PYTHONPATH=src python src/scripts/rte/da/eval_da.py data/fever/fever.db data/models/decomposable_attention.tar.gz data/fever/dev.ns.pages.p1.jsonl
First retrieve the evidence for the dev/test sets:
#Dev
PYTHONPATH=src python src/scripts/retrieval/ir.py --db data/fever/fever.db --model data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz --in-file data/fever-data/dev.jsonl --out-file data/fever/dev.sentences.p5.s5.jsonl --max-page 5 --max-sent 5
#Test
PYTHONPATH=src python src/scripts/retrieval/ir.py --db data/fever/fever.db --model data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz --in-file data/fever-data/test.jsonl --out-file data/fever/test.sentences.p5.s5.jsonl --max-page 5 --max-sent 5
Then run the model:
#Dev
PYTHONPATH=src python src/scripts/rte/da/eval_da.py data/fever/fever.db data/models/decomposable_attention.tar.gz data/fever/dev.sentences.p5.s5.jsonl --log data/decomposable_attention.dev.log
#Test
PYTHONPATH=src python src/scripts/rte/da/eval_da.py data/fever/fever.db data/models/decomposable_attention.tar.gz data/fever/test.sentences.p5.s5.jsonl --log logs/decomposable_attention.test.log
Score:
PYTHONPATH=src python src/scripts/score.py --predicted_labels data/decomposable_attention.dev.log --predicted_evidence data/fever/dev.sentences.p5.s5.jsonl --actual data/fever-data/dev.jsonl
Prepare Submission for Codalab (dev):
PYTHONPATH=src python src/scripts/prepare_submission.py --predicted_labels logs/decomposable_attention.dev.log --predicted_evidence data/fever/dev.sentences.p5.s5.jsonl --out_file predictions.jsonl
zip submission.zip predictions.jsonl
Prepare Submission for Codalab (test):
PYTHONPATH=src python src/scripts/prepare_submission.py --predicted_labels logs/decomposable_attention.test.log --predicted_evidence data/fever/test.sentences.p5.s5.jsonl --out_file predictions.jsonl
zip submission.zip predictions.jsonl