prompts.json

{
  "query": "Main task, datasets and evaluation metrics.",
  "tdm-extraction-system-prompt": "You will be given several parts of a research paper as input. Please extract different tuples including the name of the task addressed in the paper, utilized datasets and evaluation metrics and corresponding results. Extract these tuples for only the best results obtained by proposed methods of the paper not baselines. Please use json format for each different tuple. Example format: [{{\"Task\": \"Task name\", \"Dataset\": \"Dataset name\", \"Metric\": \"Metric name\", \"Result\": \"Result score\"}}]. Your answer will immediately start with the json object satisfying the given template and contain nothing else.",
  "normalization-system-prompt": "You will be given a list of items. Then, an input will be provided. You will match the input with one of the items in the list. Your answer will ONLY consist of the matched item in the list, do not provide further explanations. If none of the items matches, say None.",
  "masked-normalization-system-prompt": "You will be given a list of items. Then, an input entity will be provided. If the input entity matches one of the items in the list, your answer will be the matched item in the list. Else, output the entity without changing it. DO NOT make any other explanation.",
  "leaderboard-normalization-system-prompt": "You will be given a list of tuples. Then, an input tuple will be provided. If the input tuple matches one of the items in the list, your answer will be the matched item in the list. Else, output the tuple without changing it. Only output answer, DO NOT make any other explanation.",
  "few-shot-extraction-system-prompt": "You will be given several parts of a research paper as input. Please extract different tuples including the name of the task addressed in the paper, utilized datasets and evaluation metrics and corresponding results. Extract these tuples for only the best results obtained by proposed methods of the paper not baselines. Please use json format for each different tuple. Example format: [{{\"Task\": \"Task name\", \"Dataset\": \"Dataset name\", \"Metric\": \"Metric name\", \"Result\": \"Result score\"}}]. Your answer will immediately start with the json object satisfying the given template and contain nothing else. You will be given an example input-output pair. \n[Start of the Example]\nRetrieved document content:\nThis paper creates a paradigm shift with regard to the way we build neural extractive summarization systems. Instead of following the commonly used framework of extracting sentences individually and modeling the relationship between sentences, we formulate the extractive summarization task as a semantic text matching problem, in which a source document and candidate summaries will be (extracted from the original text) matched in a semantic space. Notably, this paradigm shift to semantic matching framework is well-grounded in our comprehensive analysis of the inherent gap between sentence-level and summary-level extractors based on the property of the dataset. Besides, even instantiating the framework with a simple form of a matching model, we have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1). Experiments on the other five datasets also show the effectiveness of the matching framework. In this paper, we propose a novel summary-level framework (MATCHSUM, Figure 1) and conceptualize extractive summarization as a semantic text matching problem. The principle idea is that a good summary should be more semantically similar as a whole to the source document than the unqualified summaries.\n5 Experiment 5.1 Datasets In order to verify the effectiveness of our framework and obtain more convincing explanations, we perform experiments on six divergent mainstream datasets as follows. CNN/DailyMail (Hermann et al., 2015) is a commonly used summarization dataset modiﬁed by Nallapati et al. (2016), which contains news articles and associated highlights as summaries. In this paper, we use the non-anonymized version. PubMed (Cohan et al., 2018) is collected from scientiﬁc papers and thus consists of long documents. We modify this dataset by using the introduction section as the document and the abstract section as the corresponding summary. WikiHow (Koupaee and Wang, 2018) is a diverse dataset extracted from an online knowledge base. Articles in it span a wide range of topics. XSum (Narayan et al., 2018a) is a one-sentence summary dataset to answer the question “What is the article about?”. All summaries are profession- ally written, typically by the authors of documents in this dataset. Multi-News (Fabbri et al., 2019) is a multi- document news summarization dataset with a relatively long summary, we use the truncated version and concatenate the source documents as a single input in all experiments. Reddit (Kim et al., 2019) is a highly abstractive dataset collected from social media platform. We only use the TIFU-long version of Reddit, which regards the body text of a post as the document and the TL;DR as the summary.\nTable 3: Results on CNN/DM test set. The model with ∗ indicates that the large version of BERT is used. BERTEXT† add an additional Pointer Network com- pared to other BERTEXT in this table. Model R-1 R-2 R-L LEAD ORACLE MATCH-ORACLE 40.43 17.62 36.67 52.59 31.23 48.87 51.08 26.94 47.22 BANDITSUM (Dong et al., 2018) NEUSUM (Zhou et al., 2018) JECS (Xu and Durrett, 2019) HIBERT (Zhang et al., 2019b) PNBERT (Zhong et al., 2019a) PNBERT + RL BERTEXT BERTEXT BERTEXT (Liu, 2019) BERTEXT + Tri-Blocking BERTSUM 41.50 18.70 37.60 41.59 19.01 37.98 41.70 18.50 37.90 42.37 19.95 38.83 42.39 19.51 38.69 42.69 19.60 38.85 42.29 19.38 38.63 42.76 19.87 39.11 42.57 19.96 39.04 43.23 20.22 39.60 ∗ (Liu and Lapata, 2019) 43.85 20.34 39.90 † (Bae et al., 2019) † + RL BERTEXT (Ours) BERTEXT + Tri-Blocking (Ours) MATCHSUM (BERT-base) MATCHSUM (RoBERTa-base) 42.73 20.13 39.20 43.18 20.16 39.56 44.22 20.62 40.38 44.41 20.86 40.55\nTable 4: Results on test sets of Reddit and XSum. N um indicates how many sentences BERTEXT ex- tracts as a summary and Sel indicates the number of sentences we choose to form a candidate summary. Model R-1 R-2 R-L Reddit BERTEXT (Num = 1) BERTEXT (Num = 2) MATCHSUM (Sel = 1) MATCHSUM (Sel = 2) MATCHSUM (Sel = 1, 2) 21.99 23.86 22.87 24.90 25.09 5.21 5.85 5.15 5.91 6.17 16.99 19.11 17.40 20.03 20.13 XSum BERTEXT (Num = 1) BERTEXT (Num = 2) MATCHSUM (Sel = 1) MATCHSUM (Sel = 2) MATCHSUM (Sel = 1, 2) 22.53 22.86 23.35 24.48 24.86 4.36 4.48 4.46 4.58 4.66 16.23 17.16 16.71 18.31 18.41\nTable 5: Results on test sets of WikiHow, PubMed and Multi-News. MATCHSUM beats the state-of-the-art BERT model with Ngram Blocking on all different domain datasets. Model R-1 WikiHow R-2 R-L R-1 PubMed R-2 R-L R-1 Multi-News R-2 LEAD ORACLE MATCH-ORACLE 24.97 35.59 35.22 5.83 12.98 10.55 23.24 32.68 32.87 37.58 45.12 42.21 12.22 20.33 15.42 33.44 40.19 37.67 43.08 49.06 47.45 14.27 21.54 17.41 BERTEXT + 3gram-Blocking + 4gram-Blocking MATCHSUM (BERT-base) 30.31 30.37 30.40 31.85 8.71 8.45 8.67 8.98 28.24 28.28 28.32 29.58 41.05 38.81 40.29 41.21 14.88 13.62 14.37 14.91 36.57 34.52 35.88 36.75 45.80 44.94 45.86 46.20 16.42 15.47 16.23 16.51 R-L 38.97 44.27 43.14 41.53 40.63 41.57 41.89\nOutput json: [{{\"Task\": \"Summarization\", \"Dataset\": \"CNN/DailyMail\", \"Metric\": \"ROGUE-1\", \"Result\": \"44.41\"}}, {{\"Task\": \"Summarization\", \"Dataset\": \"CNN/DailyMail\", \"Metric\": \"ROGUE-2\", \"Result\": \"20.86\"}}, {{\"Task\": \"Summarization\", \"Dataset\": \"CNN/DailyMail\", \"Metric\": \"ROGUE-L\", \"Result\": \"40.55\"}}, {{\"Task\": \"Summarization\", \"Dataset\": \"Reddit\", \"Metric\": \"ROGUE-1\", \"Result\": \"25.09\"}}, {{\"Task\": \"Summarization\", \"Dataset\": \"Reddit\", \"Metric\": \"ROGUE-2\", \"Result\": \"6.17\"}}, {{\"Task\": \"Summarization\", \"Dataset\": \"Reddit\", \"Metric\": \"ROGUE-L\", \"Result\": \"20.13\"}}, {{\"Task\": \"Summarization\", \"Dataset\": \"XSum\", \"Metric\": \"ROGUE-1\", \"Result\": \"24.86\"}}, {{\"Task\": \"Summarization\", \"Dataset\": \"XSum\", \"Metric\": \"ROGUE-2\", \"Result\": \"4.66\"}}, {{\"Task\": \"Summarization\", \"Dataset\": \"XSum\", \"Metric\": \"ROGUE-L\", \"Result\": \"18.41\"}}, {{\"Task\": \"Summarization\", \"Dataset\": \"WikiHow\", \"Metric\": \"ROGUE-1\", \"Result\": \"31.85\"}}, {{\"Task\": \"Summarization\", \"Dataset\": \"WikiHow\", \"Metric\": \"ROGUE-2\", \"Result\": \"8.98\"}}, {{\"Task\": \"Summarization\", \"Dataset\": \"WikiHow\", \"Metric\": \"ROGUE-L\", \"Result\": \"29.58\"}}, {{\"Task\": \"Summarization\", \"Dataset\": \"PubMed\", \"Metric\": \"ROGUE-1\", \"Result\": \"41.21\"}}, {{\"Task\": \"Summarization\", \"Dataset\": \"PubMed\", \"Metric\": \"ROGUE-2\", \"Result\": \"14.91\"}}, {{\"Task\": \"Summarization\", \"Dataset\": \"PubMed\", \"Metric\": \"ROGUE-L\", \"Result\": \"36.75\"}}, {{\"Task\": \"Summarization\", \"Dataset\": \"WikiHow\", \"Metric\": \"ROGUE-1\", \"Result\": \"46.20\"}}, {{\"Task\": \"Summarization\", \"Dataset\": \"WikiHow\", \"Metric\": \"ROGUE-2\", \"Result\": \"16.51\"}}, {{\"Task\": \"Summarization\", \"Dataset\": \"WikiHow\", \"Metric\": \"ROGUE-L\", \"Result\": \"41.89\"}}]\n[End of the Example]\n",
  "few-shot-normalization-system-prompt": "You will be given a list of items. Then, an input will be provided. You will match the input with one of the items in the list. Your answer will ONLY consist of the matched item in the list, do not provide further explanations. If none of the items matches, say None. You will be given some example input-output pairs.\n[Start of the Example]\nItem List: {{'Combinatory Categorial Grammar (CCG) Supertagging', 'Constituency Parsing', 'Dependency Parsing', 'Dialogue Act Classification', 'Dialogue Generation', 'Entity Typing', 'Intent Detection and Slot Filling', 'Language Modeling', 'Linguistic Acceptability', 'Machine Translation', 'Named Entity Recognition (NER)', 'Natural Language Inference (NLI)', 'Paraphrase Detection', 'Part-of-Speech (POS) Tagging', 'Question Answering', 'Question Generation', 'Relation Classification', 'Response Generation', 'Sentiment Analysis', 'Summarization', 'Text Chunking', 'Text Similarity', 'Word Sense Induction'}}\nInput: Pos tagging\nAnswer: Part-of-Speech (POS) Tagging\n\nItem List: {{'AESLC', 'ATIS', 'AX', 'BC5CDR', 'BIGPATENT', 'BillSum', 'CCGBank', 'CNN/DailyMail', 'CoLA', 'CoNLL  ', 'CoNLL-2000', 'CoNLL-2002 - Spanish', 'CoNLL-2002- Dutch', 'CoNLL-2003 - English', 'CoNLL-2003 - German', 'CoQA', 'ConvAI2', 'DSTC7', 'Dailydialog (DyDA)', 'E-commerce', 'ELI5', 'Gigaword', 'Google Billion Word', 'ICSI Meeting Recorder Dialog Act Corpus (MRDA)', 'IWSLT’14 EN-DE', 'IWSLT’15 EN-VI', 'MNLI-m', 'MNLI-mm', 'MRPC', 'MapTask', 'Multi-News', 'Multi-domain Food', 'Multi-domain Home', 'Multi-domain Movie', 'Multimodal', 'NCBI', 'New York Times (NYT)', 'Newsroom', 'Ontonotes v4 - Chinese', 'Ontonotes v5 - English', 'Open Entity', 'Penn Treebank (PTB)', 'PubMed', 'QNLI', 'QQP', 'RTE', 'ReCoRD', 'Reddit TIFU', 'Resume - Chinese', 'SNIPS', 'SQuAD 1.1', 'SQuAD 2.0', 'SST-2', 'STS-B', 'SemEval 2010 Task 14', 'SemEval 2013 Task 13', 'Switchboard Dialog Act Corpus (SWDA)', 'TACRED', 'WMT’14 EN-DE', 'WMT’14 EN-FR', 'WMT’16 RO-EN', 'WMT’17 EN-ZH', 'WNLI', 'WNUT-16 - English', 'WNUT-17 - English', 'Weibo - Chinese', 'WikiHow', 'WikiText-103', 'Wikitext-2', 'XSum', 'arXiv'}}\nInput: PTB\nAnswer: Penn Treebank (PTB)\n\nItem List: {{'AVG', 'Accuracy', 'BERTScore', 'BLEU', 'BLEU-4', 'Exact Match (EM)', 'F-Score (F-S)', 'F1', 'Fuzzy B-Cubed (FBC)', 'Fuzzy normalized mutual information (FNMI)', 'Labeled Attachment Score', 'METEOR', \"Matthew's Correlation Coefficient (MCC)\", 'NIST-4', 'Overall-Accuracy', 'Perplexity', 'Precision', 'ROGUE-1', 'ROGUE-2', 'ROGUE-L', 'Recall', 'Sent-Accuracy', 'Spearman Correlation', 'TER', 'Unlabeled Attachment Score', 'V-Measure (V-M)'}}\nInput: F1 score \nAnswer: F1\n[End of the Example]"
}