PIE Corpus

This is the PIE Corpus, an evaluation corpus for the automatic detection of potentially idiomatic expressions (PIEs), based on the British National Corpus (BNC). Potentially idiomatic expressions are like idiomatic expressions, but the term also covers literal uses of idiomatic expressions, such as 'I leave work at the end of the day.' for the idiom 'at the end of the day'. Using a pre-defined set of 591 PIEs and a set of 23 documents from the BNC, we extracted all sentences possibly containing one of these PIEs in one form or another, thus including literal uses. The resulting 2239 PIE-sentence pairs were then annotated for whether they really contained the PIE in question, and if so, in what sense it was used.

Contents & Usage

This repository contains six json-files containing the annotations. PIE_annotations_all_no_sentences.json contains all 2239 PIEs with annotations. The PIE_annotations_doc_*_no_sentence.json files contain the development and test set split on the document level, which is useful for PIE detection experiments. The PIE_annotations_type_*_no_sentences.json contain the training, development, and test set split on the PIE type level. That is, PIE types occurring in one set do not occur in any of the other sets. This split can be used for PIE sense disambiguation experiments. Due to copyright restrictions on the BNC, the sentence containing the PIE and the 2-sentence context window around that have been removed from these files.

To add the sentence and context, download a copy of the BNC-XML from here, and run python add_sentences_v2.py /path/to/BNC/2554/download/Texts/. This will create six new PIE_annotations_*_v2.json files, which do contain the sentence containing the PIE and a 2-sentence context window. In addition, it corrects some of the character offsets in the original annotation, using manual_offset_corrections.json, and converts the context and offsets to an updated version, version 2.

Format

The json-files contain a list of Python dictionaries, with the following information:

idiom: the dictionary form of the PIE
PIE_label: label indicating whether this sentence contains the PIE in question ('y') or not ('n')
sense_label: label indicating the sense of the PIE, can be idiomatic ('y'), literal ('n'), other ('o'), or non-applicable, in the case of a non-PIE ('?')
document_id: BNC document id
sentence_number: BNC sentence number
context: a list of 5 sentences, with the PIE always in the third sentence
offsets: character offsets of the PIE's component words, on the sentence containing the PIE in the tokenized context
context_untokenized: the same as context, but without the pre-tokenization of the BNC

Previous Format

In the previous (v1) format, sentence was used instead of context and context_untokenized, and the offsets were in a different format. To get the old format, use add_sentences.py.

sentence: 5-sentence window containing the PIE, with the PIE always in the third sentence
offsets: character offsets of the PIE's component words in the 5-sentence context window

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
LICENSE		LICENSE
PIE_annotations_all_no_sentences.json		PIE_annotations_all_no_sentences.json
PIE_annotations_doc_dev_no_sentences.json		PIE_annotations_doc_dev_no_sentences.json
PIE_annotations_doc_test_no_sentences.json		PIE_annotations_doc_test_no_sentences.json
PIE_annotations_type_dev_no_sentences.json		PIE_annotations_type_dev_no_sentences.json
PIE_annotations_type_test_no_sentences.json		PIE_annotations_type_test_no_sentences.json
PIE_annotations_type_train_no_sentences.json		PIE_annotations_type_train_no_sentences.json
README.md		README.md
add_sentences.py		add_sentences.py
add_sentences_v2.py		add_sentences_v2.py
annotation_guidelines.md		annotation_guidelines.md
manual_offset_corrections.json		manual_offset_corrections.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PIE Corpus

Contents & Usage

Format

Previous Format

About

Releases

Packages

Languages

License

hslh/pie-annotation

Folders and files

Latest commit

History

Repository files navigation

PIE Corpus

Contents & Usage

Format

Previous Format

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages