The Microsoft Research Paraphrase Corpus (MRPC) Dataset

MRPC is a dataset (also an NLP task) that contains 5800 pairs of sentences extracted from news sources on the web, along with human annotations indicating each sentence pair captures a paraphrase or semantic equivalence relationship. The dataset is published on Jan 3rd 2005.

The paper is uploaded in this repo named MRPC_2005.pdf.

Note: According to the paper, there should be 5800 sentence pairs in the dataset. However, as we found in the dataset file, there is 5801 sentence pairs. Therefore, we record this issue here in case that some users find out by themselves and feel strange.

MRPC is also included a task in the General Language Understanding Evaluation (GLUE). Due to the copyright issue, GLUE can only provide a pre-processing routine to reorganize MRPC dataset, rather than include the original dataset in it, or directly publicize the reorganized dataset. Therefore, users need to complete two steps to use the MRPC dataset in their DL models.

Download the raw dataset from Microsoft and unzip the file. This is what code in this repo is doing.
Pre-processing the dataset using script from GLUE. Details of this step is shown in GLUE's repo.

Download and Unzip MRPC Dataset

The steps to download and unzip MRPC dataset.

Run download_mrpc.zsh to download.
Wait! Actually the unzipped files are committed in this repo. You can just skip next step and continue.
Run process_mrpc.zsh to unzip and complete other minor operations.

The output of the processing are two txt files

./MRPC/msr_paraphrase_test.txt
./MRPC/msr_paraphrase_train.txt

Both files are less than 1MB in size, so these two files are committed to github as well. The user does not need to run the two steps above in order to use the MRPC dataset.

msr_paraphrase_test.txt contains 1725 entries of sentences pairs with labeling and IDs.

msr_paraphrase_train.txt contains 4076 entries of sentences pairs with labeling and IDs

The test and train set has 5801 pairs add up together.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Microsoft Research Paraphrase Corpus (MRPC) Dataset

Download and Unzip MRPC Dataset

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
MRPC		MRPC
MRPC_2005.pdf		MRPC_2005.pdf
README.md		README.md
download_mrpc.zsh		download_mrpc.zsh
process_mrpc.zsh		process_mrpc.zsh

magsail/MRPC

Folders and files

Latest commit

History

Repository files navigation

The Microsoft Research Paraphrase Corpus (MRPC) Dataset

Download and Unzip MRPC Dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages