1 |
paramed |
MT |
Train:62,127; Dev:2,036; Test:2,102 |
[Dataset][paper] |
NEJM is a Chinese-English parallel corpus crawled from the New England Journal of Medicine website. English articles are distributed through https://www.nejm.org/ and Chinese articles are distributed through http://nejmqianyan.cn/. The corpus contains all article pairs (around 2000 pairs) since 2011. |
2 |
medal |
NER |
Train:300,000; Dev:100,000; Test:100,000 |
[Dataset][paper] |
The Repository for Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. |
3 |
anat_em |
NER |
Train:606; Dev:202; Test:404 |
[Dataset][paper] |
Anatomical entity mention recognition: the recognition of mentions of anatomical entities, organism parts at levels of organization between the molecular and the whole organism. |
4 |
chemdner |
NER |
Train:3,500; Dev:3,500; Test:3,000 |
[Dataset][paper] |
Chemical compound and drug name recognition task: detect mentions of chemical compounds and drugs from the text. |
5 |
scai_disease |
NER |
Train:400 |
[Dataset][paper] |
SCAI Disease is a dataset annotated in 2010 with mentions of diseases and adverse effects. It is a corpus containing 400 randomly selected MEDLINE abstracts generated using ‘Disease OR Adverse effect’ as a PubMed query. This evaluation corpus was annotated by two individuals who hold a Master’s degree in life sciences. |
6 |
tmvar_v1 |
NER |
Train:334; Test:166 |
[Dataset][paper] |
Extracting sequence variants in biomedical literature. |
7 |
scai_chemical |
NER |
Train:100 |
[Dataset][paper] |
SCAI Chemical is a corpus of MEDLINE abstracts that has been annotated to give an overview of the different chemical name classes found in MEDLINE text. |
8 |
nlmchem |
NER |
Train:80; Dev:20; Test:50 |
[Dataset][paper] |
NLM-Chem corpus consists of 150 full-text articles from the PubMed Central Open Access dataset, comprising 67 different chemical journals, aiming to cover a general distribution of usage of chemical names in the biomedical literature. Articles were selected so that human annotation was most valuable (meaning that they were rich in bio-entities, and current state-of-the-art named entity recognition systems disagreed on bio-entity recognition. |
9 |
ask_a_patient |
NER |
Train:156,652; Dev:7,926; Test:8,662 |
[Dataset][paper] |
The AskAPatient dataset contains medical concepts written on social media mapped to how they are formally written in medical ontologies (SNOMED-CT and AMT). |
10 |
citation_gia_test_collection |
NER |
Test:151 |
[Dataset][paper] |
The Citation GIA Test Collection was recently created for gene indexing at the NLM and includes 151 PubMed abstracts with both mention-level and document-level annotations. They are selected because both have a focus on human genes. |
11 |
linnaeus |
NER |
Train:95 |
[Dataset][paper] |
Linnaeus is a novel corpus of full-text documents manually annotated for species mentions. |
12 |
mirna |
NER |
Train:201; Test:100 |
[Dataset][paper] |
The corpus consists of 301 Medline citations. The documents were screened for mentions of miRNA in the abstract text. Gene, disease and miRNA entities were manually annotated. The corpus comprises of two separate files, a train and a test set, coming from 201 and 100 documents respectively. |
13 |
osiris |
NER |
Train:105 |
[Dataset][paper] |
The OSIRIS corpus is a set of MEDLINE abstracts manually annotated with human variation mentions. The corpus is distributed under the terms of the Creative Commons Attribution License Creative Commons Attribution 3.0 Unported License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (Furlong et al, BMC Bioinformatics 2008, 9:84). |
14 |
tmvar_v2 |
NER |
Train:158 |
[Dataset][paper] |
This dataset contains 158 PubMed articles manually annotated with mutation mentions of various kinds and dbsnp normalizations for each of them. It can be used for NER tasks and NED tasks, This dataset has a single split. |
15 |
tmvar_v3 |
NER |
Test:500 |
[Dataset][paper] |
This dataset contains 500 PubMed articles manually annotated with mutation mentions of various kinds and dbsnp normalizations for each of them. In addition, it contains variant normalization options such as allele-specific identifiers from the ClinGen Allele Registry It can be used for NER tasks and NED tasks, This dataset does NOT have splits. |
16 |
twadrl |
NER |
Train:48,200; Dev:1,300; Test:1,430 |
[Dataset][paper] |
The TwADR-L dataset contains medical concepts written on social media (Twitter) mapped to how they are formally written in medical ontologies (SIDER 4). |
17 |
cellfinder |
NER |
Train:5; Test:5 |
[Dataset][paper] |
The CellFinder project aims to create a stem cell data repository by linking information from existing public databases and by performing text mining on the research literature. The first version of the corpus is composed of 10 full text documents containing more than 2,100 sentences, 65,000 tokens and 5,200 annotations for entities. The corpus has been annotated with six types of entities (anatomical parts, cell components, cell lines, cell types, genes/protein and species) with an overall inter-annotator agreement around 80%. |
18 |
ebm_pico |
NER |
Train:4,746; Test:187 |
[Dataset][paper] |
This corpus release contains 4,993 abstracts annotated with (P)articipants, (I)nterventions, and (O)utcomes. Training labels are sourced from AMT workers and aggregated to reduce noise. Test labels are collected from medical professionals. |
19 |
genetag |
NER |
Train:7,500; Dev:5,000; Test:2,500 |
[Dataset][paper] |
The annotation of such a corpus for gene/protein name NER is a difficult process due to the complexity of gene/protein names. We describe the construction and annotation of GENETAG, a corpus of 20K MEDLINE® sentences for gene/protein NER. 15K GENETAG sentences were used for the BioCreAtIvE Task 1A Competition. |
20 |
pico_extraction |
NER |
Train:421 |
[Dataset][paper] |
This dataset contains annotations for Participants, Interventions, and Outcomes (referred to as PICO task). For 423 sentences, annotations collected by 3 medical experts are available. To get the final annotations, we perform the majority voting. |
21 |
progene |
NER |
Train:309,267; Dev:16,769; Test:36,234 |
[Dataset][paper] |
The Protein/Gene corpus was developed at the JULIE Lab Jena under supervision of Prof. Udo Hahn. The executing scientist was Dr. Joachim Wermter. The main annotator was Dr. Rico Pusch who is an expert in biology. The corpus was developed in the context of the StemNet project (http://www.stemnet.de/). |
22 |
n2c2_2009_medication |
NER |
Train:10; Test:251 |
[Dataset][paper] |
The Third i2b2 Workshop on Natural Language Processing Challenges for Clinical Records focused on the identification of medications, their dosages, modes (routes) of administration, frequencies, durations, and reasons for administration in discharge summaries. |
23 |
n2c2_2006_deid |
NER |
Train:669; Test:220 |
[Dataset][paper] |
The data for the de-identification challenge came from Partners Healthcare and included solely medical discharge summaries. We prepared the data for the challengeby annotating and by replacing all authentic PHI with realistic surrogates. Given the above definitions, we marked the authentic PHI in the records in two stages. |
24 |
n2c2_2014_deid |
NER |
Train:790; Test:514 |
[Dataset][paper] |
The de-identification track focused on identifying protected health information (PHI) in longitudinal clinical narratives. TRACK 1: NER PHIn HIPAA requires that patient medical records have all identifying information removed in order to protect patient privacy. There are 18 categories of Protected Health Information (PHI) identifiers of the patient or of relatives, employers, or household members of the patient that must be removed in order for a file to be considered de-identified. |
25 |
gnormplus |
NER/NEN |
Train:432; Test:262 |
[Dataset][paper] |
Identify gene/protein names in biomedical literature, including gene/protein mentions, family names and domain names. |
26 |
ncbi_disease |
NER/NEN |
Train:592; Dev:100; Test:100 |
[Dataset][paper] |
Automatic disease recognition from free text. |
27 |
nlm_gene |
NER/NEN |
Train:450; Test:100 |
[Dataset][paper] |
Extract gene entities from the text. |
28 |
medmentions |
NER/NEN |
Train:2,635; Dev:878; Test:879 |
[Dataset][paper] |
MedMentions is a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines. |
29 |
mediqa_qa |
QA-cqa |
Train:208; Dev:25; Test:150 |
[Dataset][paper] |
The MEDIQA challenge is an ACL-BioNLP 2019 shared task aiming to attract further research efforts in Natural Language Inference (NLI), Recognizing Question Entailment (RQE), and their applications in medical Question Answering (QA). |
30 |
emrQA |
QA-cqa |
Train:117,678 |
[Dataset][paper] |
emrQA is a semi-automatically generated large-scale question answering dataset on clinical notes available across several i2b2 challlenges (until 2014). |
31 |
DrugEHRQA |
QA-cqa |
query1:505; query2:8,218; query3:5,517; query4:7,291; query5:5,952; query6:809; query7:6,015; query8:5,302; query9:2,093 |
[Dataset][paper] |
DrugEHRQA is the first question answering (QA) dataset containing question-answers from both structured tables and discharge summaries of MIMIC-III. |
32 |
med_qa |
QA-multiple_choice |
Train:10,178; Dev:1,272; Test:1,273 |
[Dataset][paper] |
The first free-form multiple-choice OpenQA dataset for solving medical problems, MedQA, collected from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively. Together with the question data, we also collect and release a large-scale corpus from medical textbooks from which the reading comprehension models can obtain necessary knowledge for answering the questions. |
33 |
biomrc |
QA-multiple_choice |
Train:700,000; Dev:50,000; Test:62,707 |
[Dataset][paper] |
A large-scale cloze-style biomedical MRC dataset. Care was taken to reduce noise, compared to the previous BIOREAD dataset of Pappas et al. (2018). |
34 |
evidence_inference |
QA-multiple_choice |
Train:10,056; Dev:1,233; Test:1,222 |
[Dataset][paper] |
The dataset consists of biomedical articles describing randomized control trials (RCTs) that compare multiple treatments. Each of these articles will have multiple questions, or 'prompts' associated with them. These prompts will ask about the relationship between an intervention and comparator with respect to an outcome, as reported in the trial. For example, a prompt may ask about the reported effects of aspirin as compared to placebo on the duration of headaches. For the sake of this task, we assume that a particular article will report that the intervention of interest either significantly increased, significantly decreased or had significant effect on the outcome, relative to the comparator. |
35 |
medhop |
QA-multiple_choice |
Train:1,620; Dev:342 |
[Dataset][paper] |
With the same format as WikiHop, this dataset is based on research paper abstracts from PubMed, and the queries are about interactions between pairs of drugs. The correct answer has to be inferred by combining information from a chain of reactions of drugs and proteins. |
36 |
sciq |
QA-multiple_choice |
Train:1,1679; Dev:1,000; Test:1,000 |
[Dataset][paper] |
The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For most questions, an additional paragraph with supporting evidence for the correct answer is provided. |
37 |
MedMCQA |
QA-multiple_choice |
Train:70,735; Dev:4,183 |
[Dataset][paper] |
A large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions. |
38 |
LiveQA |
QA-sqa |
Train:446; Test:104 |
[Dataset][paper] |
The LiveQA'17 medical task focuses on consumer health question answering. We use consumer health questions received by the U.S. National Library of Medicine (NLM). |
39 |
Medication_QA |
QA-sqa |
Train:690 |
[Dataset][paper] |
This task focuses on real consumer health question answering. The data consists of six hundred and seventy-four question-answer pairs with annotations of the question focus and type and the answer source. |
40 |
bionlp_st_2011_rel |
NER,RE,COREF |
Train:800; Dev:150; Test:260 |
[Dataset][paper] |
The Entity Relations (REL) task is a supporting task of the BioNLP Shared Task 2011. The task concerns the extraction of two types of part-of relations between a gene/protein and an associated entity. |
41 |
iCliniq-10k |
QA-sqa |
Train:21,963 |
[Dataset][paper] |
ChatDoctor data: 100k real conversations between patients and doctors from HealthCareMagic.com. |
42 |
HealthCareMagic-100k |
QA-sqa |
Train:112,164 |
[Dataset][paper] |
ChatDoctor data: 100k real conversations between patients and doctors from HealthCareMagic.com. |
43 |
biology_how_why_corpus |
QA-sqa |
Train:1,270 |
[Dataset][paper] |
This dataset consists of 185 "how" and 193 "why" biology questions authored by a domain expert, with one or more gold answer passages identified in an undergraduate textbook. The expert was not constrained in any way during the annotation process, so gold answers might be smaller than a paragraph or span multiple paragraphs. This dataset was used for the question-answering system described in the paper “Discourse Complements Lexical Semantics for Non-factoid Answer Reranking” (ACL 2014). |
44 |
pubmed_qa |
QA-yesno |
Train:450; Dev:50; Test:500 |
[Dataset][paper] |
PubMedQA is a novel biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research biomedical questions with yes/no/maybe using the corresponding abstracts. PubMedQA has 1k expert-annotated (PQA-L), 61.2k unlabeled (PQA-U) and 211.3k artificially generated QA instances (PQA-A). |
45 |
iepa |
RE |
Train:161; Test:41 |
[Dataset][paper] |
The IEPA benchmark PPI corpus is designed for relation extraction. It was created from 303 PubMed abstracts, each of which contains a specific pair of co-occurring chemicals. |
46 |
lll |
RE |
Train:77 |
[Dataset][paper] |
The LLL05 challenge task is to learn rules to extract protein/gene interactions from biology abstracts from the Medline bibliography database. The goal of the challenge is to test the ability of the participating IE systems to identify the interactions and the gene/proteins that interact. |
47 |
genia_relation_corpus |
RE |
Train:800; Dev:150; Test:260 |
[Dataset][paper] |
The extraction of various relations stated to hold between biomolecular entities is one of the most frequently addressed information extraction tasks in domain studies. Typical relation extraction targets involve protein-protein interactions or gene regulatory relations. However, in the GENIA corpus, such associations involving change in the state or properties of biomolecules are captured in the event annotation. The GENIA corpus relation annotation aims to complement the event annotation of the corpus by capturing (primarily) static relations, relations such as part-of that hold between entities without (necessarily) involving change. |
48 |
MIMICause |
RE |
Train:2,714 |
[Dataset][paper] |
identify types and direction of causal relations between a pair of biomedical concepts in clinical notes; communicated implicitly or explicitly, identified either in a single sentence or across multiple sentences. |
49 |
bc7_litcovid |
TC |
Train:24,960; Dev:2,500; Test:6,239 |
[Dataset][paper] |
The training and development datasets contain the publicly-available text of over 30 thousand COVID-19-related articles and their metadata (e.g., title, abstract, journal). Articles in both datasets have been manually reviewed and articles annotated by in-house models. |
50 |
geokhoj_v1 |
TC |
Train:25,000; Test:5,000 |
[Dataset][paper] |
GEOKhoj v1 is a annotated corpus of control/perturbation labels for 30,000 samples from Microarray, Transcriptomics and Single cell experiments which are available on the GEO (Gene Expression Omnibus) database. |
51 |
gad |
TC |
Train:4,261 |
[Dataset][paper] |
A corpus identifying associations between genes and diseases by a semi-automatic annotation procedure based on the Genetic Association Database. |
52 |
hallmarks_of_cancer |
TC |
Train:12,119; Dev:1,798; Test:3,547 |
[Dataset][paper] |
The Hallmarks of Cancer (HOC) Corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to a taxonomy. The taxonomy consists of 37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the corpus. The labels are found under the "labels" directory, while the tokenized text can be found under "text" directory. The filenames are the corresponding PubMed IDs (PMID). |
53 |
meddialog |
TC |
Train:981; Dev:126; Test:122 |
[Dataset][paper] |
The MedDialog dataset (English) contains conversations (in English) between doctors and patients.It has 0.26 million dialogues. The data is continuously growing and more dialogues will be added. The raw dialogues are from healthcaremagic.com and icliniq.com. All copyrights of the data belong to healthcaremagic.com and icliniq.com. |
54 |
pubhealth |
TC |
Train:9,804; Dev:1,223; Test:1,231 |
[Dataset][paper] |
A dataset of 11,832 claims for fact- checking, which are related a range of health topics including biomedical subjects (e.g., infectious diseases, stem cell research), government healthcare policy (e.g., abortion, mental health, women’s health), and other public health-related stories. |
55 |
scicite |
TC |
Train:8,243; Dev:916; Test:1,861 |
[Dataset][paper] |
SciCite is a dataset of 11K manually annotated citation intents based on citation context in the computer science and biomedical domains. |
56 |
n2c2_2006_smokers |
TC |
Train:398; Test:104 |
[Dataset][paper] |
The data for the n2c2 2006 smoking challenge consisted of discharge summaries from Partners HealthCare, which were then de-identified, tokenized, broken into sentences, converted into XML format, and separated into training and test sets. Two pulmonologists annotated each record with the smoking status of patients based strictly on the explicitly stated smoking-related facts in the records. |
57 |
n2c2_2008_obesity |
TC |
Train:730; Test:507 |
[Dataset][paper] |
The data for the n2c2 2008 obesity challenge consisted of discharge summaries from the Partners HealthCare Research Patient Data Repository. These data were chosen from the discharge summaries of patients who were overweight or diabetic and had been hospitalized for obesity or diabetes sometime since 12/1/04. De-identification was performed semi-automatically. |
58 |
n2c2_2018_track1 |
TC |
Train:202; Test:86 |
[Dataset][paper] |
Track 1 of the 2018 National NLP Clinical Challenges shared tasks focused on identifying which patients in a corpus of longitudinal medical records meet and do not meet identified selection criteria. This shared task aimed to determine whether NLP systems could be trained to identify if patients met or did not meet a set of selection criteria taken from real clinical trials. T |
59 |
scifact |
TC |
Train:809; Dev:300; Test:300 |
[Dataset][paper] |
SciFact is a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts, and annotated with labels and rationales. This config has abstracts and document ids. {_DESCRIPTION_BASE} This config connects the claims to the evidence and doc ids. {_DESCRIPTION_BASE} This task is the following: given a claim and a text span composed of one or more sentences from an abstract, predict a label from ("rationale", "not_rationale") indicating if the span is evidence (can be supporting or refuting) for the claim. This roughly corresponds to the second task outlined in Section 5 of the paper." {_DESCRIPTION_BASE} This task is the following: given a claim and a text span composed of one or more sentences from an abstract, predict a label from ("SUPPORT", "NOINFO", "CONTRADICT") indicating if the span supports, provides no info, or contradicts the claim. This roughly corresponds to the thrid task outlined in Section 5 of the paper. |
60 |
bio_sim_verb |
TP-ss |
Train:1,000 |
[Dataset][paper] |
This repository contains the evaluation datasets for the paper Bio-SimVerb and Bio-SimLex: Wide-coverage Evaluation Sets of Word Similarity in Biomedicine by Billy Chiu, Sampo Pyysalo and Anna Korhonen. |
61 |
bio_simlex |
TP-ss |
Train:988 |
[Dataset][paper] |
Bio-SimLex enables intrinsic evaluation of word representations. This evaluation can serve as a predictor of performance on various downstream tasks in the biomedical domain. The results on Bio-SimLex using standard word representation models highlight the importance of developing dedicated evaluation resources for NLP in biomedicine for particular word classes (e.g. verbs). |
62 |
biosses |
TP-ss |
Train:128; Dev:32; Test:40 |
[Dataset][paper] |
BIOSSES computes similarity of biomedical sentences by utilizing WordNet as the general domain ontology and UMLS as the biomedical domain specific ontology. |
63 |
ehr_rel |
TP-ss |
Train:3,741 |
[Dataset][paper] |
EHR-Rel is a novel open-source1 biomedical concept relatedness dataset consisting of 3630 concept pairs, six times more than the largest existing dataset. Instead of manually selecting and pairing concepts as done in previous work, the dataset is sampled from EHRs to ensure concepts are relevant for the EHR concept retrieval task. A detailed analysis of the concepts in the dataset reveals a far larger coverage compared to existing datasets. |
64 |
mayosrs |
TP-ss |
Train:101 |
[Dataset][paper] |
MayoSRS consists of 101 clinical term pairs whose relatedness was determined by nine medical coders and three physicians from the Mayo Clinic. |
65 |
minimayosrs |
TP-ss |
Train:29 |
[Dataset][paper] |
MiniMayoSRS is a subset of the MayoSRS and consists of 30 term pairs on which a higher inter-annotator agreement was achieved. The average correlation between physicians is 0.68. The average correlation between medical coders is 0.78. |
66 |
mqp |
TP-ss |
Train:3,048 |
[Dataset][paper] |
Medical Question Pairs dataset by McCreery et al (2020) contains pairs of medical questions and paraphrased versions of the question prepared by medical professional. Paraphrased versions were labelled as similar (syntactically dissimilar but contextually similar ) or dissimilar (syntactically may look similar but contextually dissimilar). Labels 1: similar, 0: dissimilar. |
67 |
umnsrs |
TP-ss |
Train:566 |
[Dataset][paper] |
UMNSRS, developed by Pakhomov, et al., consists of 725 clinical term pairs whose semantic similarity and relatedness. The similarity and relatedness of each term pair was annotated based on a continuous scale by having the resident touch a bar on a touch sensitive computer screen to indicate the degree of similarity or relatedness. |
68 |
scitail |
TP-te |
Train:23,596; Dev:1,304; Test:2,126 |
[Dataset][paper] |
The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis. We use information retrieval to obtain relevant text from a large text corpus of web sentences, and use these sentences as a premise P. We crowdsource the annotation of such premise-hypothesis pair as supports (entails) or not (neutral), in order to create the SciTail dataset. The dataset contains 27,026 examples with 10,101 examples with entails label and 16,925 examples with neutral label. |
69 |
mediqa_rqe |
TP-te |
Train:8,588; Dev:302; Test:230 |
[Dataset][paper] |
The MEDIQA challenge is an ACL-BioNLP 2019 shared task aiming to attract further research efforts in Natural Language Inference (NLI), Recognizing Question Entailment (RQE), and their applications in medical Question Answering (QA). |
70 |
meqsum |
TT-ds |
Train:1,000 |
[Dataset][paper] |
Dataset for medical question summarization introduced in the ACL 2019 paper "On the Summarization of Consumer Health Questions". Question understanding is one of the main challenges in question answering. |
71 |
multi_xscience |
TT-ds |
Train:30,369; Dev:5,066; Test:5,093 |
[Dataset][paper] |
Multi-document summarization is a challenging task for which there exists little large-scale datasets. We propose Multi-XScience, a large-scale multi-document summarization dataset created from scientific articles. Multi-XScience introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references. Our work is inspired by extreme summarization, a dataset construction protocol that favours abstractive modeling approaches. Descriptive statistics and empirical results---using several state-of-the-art models trained on the Multi-XScience dataset---reveal t hat Multi-XScience is well suited for abstractive models. |
72 |
bionlp_st_2011_epi |
EE,NER,COREF |
Train:600; Dev:200 |
[Dataset][paper] |
The dataset of the Epigenetics and Post-translational Modifications (EPI) task of BioNLP Shared Task 2011. |
73 |
bionlp_st_2011_ge |
EE,NER,COREF |
Train:908; Test:347 |
[Dataset][paper] |
The BioNLP-ST GE task has been promoting development of fine-grained information extraction (IE) from biomedical documents, since 2009. Particularly, it has focused on the domain of NFkB as a model domain of Biomedical IE. The GENIA task aims at extracting events occurring upon genes or gene products, which are typed as "Protein" without differentiating genes from gene products. Other types of physical entities, e.g. cells, cell components, are not differentiated from each other, and their type is given as "Entity". |
74 |
bionlp_st_2013_cg |
EE,NER,COREF |
Train:300; Dev:100; Test:200 |
[Dataset][paper] |
the Cancer Genetics (CG) is a event extraction task and a main task of the BioNLP Shared Task (ST) 2013. The CG task is an information extraction task targeting the recognition of events in text, represented as structured n-ary associations of given physical entities. In addition to addressing the cancer domain, the CG task is differentiated from previous event extraction tasks in the BioNLP ST series in addressing a wide range of pathological processes and multiple levels of biological organization, ranging from the molecular through the cellular and organ levels up to whole organisms. Final test set submissions were accepted from six teams. |
75 |
bionlp_st_2013_pc |
EE,NER,COREF |
Train:260; Dev:90; Test:175 |
[Dataset][paper] |
the Pathway Curation (PC) task is a main event extraction task of the BioNLP shared task (ST) 2013. The PC task concerns the automatic extraction of biomolecular reactions from text. The task setting, representation and semantics are defined with respect to pathway model standards and ontologies (SBML, BioPAX, SBO) and documents selected by relevance to specific model reactions. Two BioNLP ST 2013 participants successfully completed the PC task. The highest achieved F-score, 52.8%, indicates that event extraction is a promising approach to supporting pathway curation efforts. |
76 |
bionlp_st_2011_id |
EE,NER,COREF |
Train:152; Dev:46; Test:118 |
[Dataset][paper] |
The dataset of the Infectious Diseases (ID) task of BioNLP Shared Task 2011. |
77 |
genia_ptm_event_corpus |
EE,NER,COREF |
Train:112 |
[Dataset][paper] |
Post-translational-modifications (PTM), amino acid modifications of proteins after translation, are one of the posterior processes of protein biosynthesis for many proteins, and they are critical for determining protein function such as its activity state, localization, turnover and interactions with other biomolecules. While there have been many studies of information extraction targeting individual PTM types, there was until recently little effort to address extraction of multiple PTM types at once in a unified framework. |
78 |
bionlp_shared_task_2009 |
EE,NER,COREF |
Train:800; Dev:150; Test:260 |
[Dataset][paper] |
The BioNLP Shared Task 2009 was organized by GENIA Project and its corpora were curated based on the annotations of the publicly available GENIA Event corpus and an unreleased (blind) section of the GENIA Event corpus annotations, used for evaluation. |
79 |
bionlp_st_2013_ge |
EE,NER,RE,COREF |
Train:222; Dev:249; Test:305 |
[Dataset][paper] |
The BioNLP-ST GE task has been promoting development of fine-grained information extraction (IE) from biomedical documents, since 2009. Particularly, it has focused on the domain of NFkB as a model domain of Biomedical IE. |
80 |
mlee |
EE,NER,RE,COREF |
Train:131; Dev:44; Test:87 |
[Dataset][paper] |
MLEE is an event extraction corpus consisting of manually annotated abstracts of papers on angiogenesis. It contains annotations for entities, relations, events and coreferences The annotations span molecular, cellular, tissue, and organ-level processes. |
81 |
bionlp_st_2013_gro |
EE,NER,RE,COREF |
Train:150; Dev:50; Test:100 |
[Dataset][paper] |
GRO Task: Populating the Gene Regulation Ontology with events and relations. A data set from the bio NLP shared tasks competition from 2013. |
82 |
pdr |
EE,NER,RE,COREF |
Train:179 |
[Dataset][paper] |
The corpus of plant-disease relation annotated plants and diseases and their relation to PubMed abstract. We annotated NCBI Taxonomy ID and MEDIC ID for plant and disease mention respectively. We annotated 1,307 relations from 199 abstracts, where the numbers of annotated plants and diseases were 1,403 and 1,758, respectively. |
83 |
MedDialog-en |
MRD,QA-sqa |
dialogue1:34,297; dialogue2:106,185; dialogue3:69,356; dialogue4:16,502 |
[Dataset][paper] |
The MedDialog dataset (English) contains conversations (in English) between doctors and patients. It has 0.26 million dialogues. The data is continuously growing and more dialogues will be added. The raw dialogues are from healthcaremagic.com and icliniq.com. |
84 |
bioinfer |
NER,RE |
Train:894; Test:206 |
[Dataset][paper] |
A corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers. Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies. |
85 |
ddi_corpus |
NER,RE |
Train:714; Test:303 |
[Dataset][paper] |
The DDI corpus has been manually annotated with drugs and pharmacokinetics and pharmacodynamics interactions. It contains 1025 documents from two different sources: DrugBank database and MedLine. |
86 |
hprd50 |
NER,RE |
Train:34; Test:9 |
[Dataset][paper] |
HPRD50 is a dataset of randomly selected, hand-annotated abstracts of biomedical papers referenced by the Human Protein Reference Database (HPRD). It is parsed in XML format, splitting each abstract into sentences, and in each sentence there may be entities and interactions between those entities. In this particular dataset, entities are all proteins and interactions are thus protein-protein interactions. Moreover, all entities are normalized to the HPRD database. These normalized terms are stored in each entity's 'type' attribute in the source XML. This means the dataset can determine e.g. that "Janus kinase 2" and "Jak2" are referencing the same normalized entity. Because the dataset contains entities and relations, it is suitable for Named Entity Recognition and Relation Extraction. |
87 |
chebi_nactem |
NER,RE |
Train:100 |
[Dataset][paper] |
The ChEBI corpus contains 199 annotated abstracts and 100 annotated full papers. All documents in the corpus have been annotated for named entities and relations between these. In total, our corpus provides over 15000 named entity annotations and over 6,000 relations between entities. |
88 |
chia |
NER,RE |
Train:2,000 |
[Dataset][paper] |
A large annotated corpus of patient eligibility criteria extracted from 1,000 interventional, Phase IV clinical trials registered in ClinicalTrials.gov. This dataset includes 12,409 annotated eligibility criteria, represented by 41,487 distinctive entities of 15 entity types and 25,017 relationships of 12 relationship types. |
89 |
euadr |
NER,RE |
Train:300 |
[Dataset][paper] |
Corpora with specific entities and relationships annotated are essential to train and evaluate text-mining systems that are developed to extract specific structured information from a large corpus. In this paper we describe an approach where a named-entity recognition system produces a first annotation and annotators revise this annotation using a web-based interface. The agreement figures achieved show that the inter-annotator agreement is much better than the agreement with the system provided annotations. The corpus has been annotated for drugs, disorders, genes and their inter-relationships. For each of the drug-disorder, drug-target, and target-disorder relations three experts have annotated a set of 100 abstracts. These annotated relationships will be used to train and evaluate text-mining software to capture these relationships in texts. |
90 |
seth_corpus |
NER,RE |
Train:630 |
[Dataset][paper] |
SNP named entity recognition corpus consisting of 630 PubMed citations. |
91 |
verspoor_2013 |
NER,RE |
Train:120 |
[Dataset][paper] |
This dataset contains annotations for a small corpus of full text journal publications on the subject of inherited colorectal cancer. It is suitable for Named Entity Recognition and Relation Extraction tasks. It uses the Variome Annotation Schema, a schema that aims to capture the core concepts and relations relevant to cataloguing and interpreting human genetic variation and its relationship to disease, as described in the published literature. The schema was inspired by the needs of the database curators of the International Society for Gastrointestinal Hereditary Tumours (InSiGHT) database, but is intended to have application to genetic variation information in a range of diseases. |
92 |
bionlp_st_2019_bb |
NER,RE |
Train:133; Dev:66; Test:96 |
[Dataset][paper] |
The task focuses on the extraction of the locations and phenotypes of microorganisms from PubMed abstracts and full-text excerpts, and the characterization of these entities with respect to reference knowledge sources (NCBI taxonomy, OntoBiotope ontology). The task is motivated by the importance of the knowledge on biodiversity for fundamental research and applications in microbiology. |
93 |
biored |
NER,RE |
Train:400; Dev:100; Test:100 |
[Dataset][paper] |
Relation Extraction corpus with multiple entity types (e.g., gene/protein, disease, chemical) and relation pairs (e.g., gene-disease; chemical-chemical), on a set of 600 PubMed articles. |
94 |
cpi |
NER,RE |
Train:1,808 |
[Dataset][paper] |
The compound-protein relationship (CPI) dataset consists of 2,613 sentences from abstracts containing annotations of proteins, small molecules, and their relationships. |
95 |
spl_adr_200db |
NER,RE |
Train:101 |
[Dataset][paper] |
The United States Food and Drug Administration (FDA) partnered with the National Library of Medicine to create a pilot dataset containing standardised information about known adverse reactions for 200 FDA-approved drugs. The Structured Product Labels (SPLs), the documents FDA uses to exchange information about drugs and other products, were manually annotated for adverse reactions at the mention level to facilitate development and evaluation of text mining tools for extraction of ADRs from all SPLs. The ADRs were then normalised to the Unified Medical Language System (UMLS) and to the Medical Dictionary for Regulatory Activities (MedDRA). |
96 |
jnlpba |
NER |
Train:18,546; Dev:3,856 |
[Dataset][paper] |
NER For Bio-Entities. |
97 |
n2c2_2010_relation |
NER,RE |
Train:97; Test:256 |
[Dataset][paper] |
The i2b2/VA corpus contained de-identified discharge summaries from Beth Israel Deaconess Medical Center, Partners Healthcare, and University of Pittsburgh Medical Center (UPMC). The 2010 i2b2/VA Workshop on Natural Language Processing Challenges for Clinical Records comprises three tasks: 1) a concept extraction task focused on the extraction of medical concepts from patient reports; 2) an assertion classification task focused on assigning assertion types for medical problem concepts; 3) a relation classification task focused on assigning relation types that hold between medical problems, tests, and treatments. |
98 |
n2c2_2018_track2 |
NER,RE |
Train:303; Test:202 |
[Dataset][paper] |
The National NLP Clinical Challenges (n2c2), organized in 2018, continued the legacy of i2b2 (Informatics for Biology and the Bedside), adding 2 new tracks and 2 new sets of data to the shared tasks organized since 2006. Track 2 of 2018 n2c2 shared tasks focused on the extraction of medications, with their signature information, and adverse drug events (ADEs) from clinical narratives. This track built on our previous medication challenge, but added a special focus on ADEs. |
99 |
drugprot |
NER,RE |
Train:3,500; Dev:750 |
[Dataset][paper] |
The DrugProt corpus consists of a) expert-labelled chemical and gene mentions, and (b) all binary relationships between them corresponding to a specific set of biologically relevant relation types. |
100 |
bc5cdr |
NER/NEN,RE |
Train:500; Dev:500; Test:500 |
[Dataset][paper] |
chemical-induced disease relation extraction: automatic extraction of mechanistic and biomarker chemical-induced disease relations from the biomedical literature. |
101 |
biorelex |
NER,RE,COREF |
Train:1,405; Dev:201 |
[Dataset][paper] |
BioRelEx is a biological relation extraction dataset. Version 1.0 contains 2010 annotated sentences that describe binding interactions between various biological entities (proteins, chemicals, etc.). 1405 sentences are for training, another 201 sentences are for validation. All sentences contain words "bind", "bound" or "binding". For every sentence we provide: 1) Complete annotations of all biological entities that appear in the sentence 2) Entity types (32 types) and grounding information for most of the proteins and families (links to uniprot, interpro and other databases) 3) Coreference between entities in the same sentence (e.g. abbreviations and synonyms) 4) Binding interactions between the annotated entities 5) Binding interaction types: positive, negative (A does not bind B) and neutral (A may bind to B). |
102 |
an_em |
NER,RE,COREF |
Train:250; Dev:50; Test:200 |
[Dataset][paper] |
AnEM corpus is a domain- and species-independent resource manually annotated for anatomical entity mentions using a fine-grained classification system. The corpus consists of 500 documents (over 90,000 words) selected randomly from citation abstracts and full-text papers with the aim of making the corpus representative of the entire available biomedical scientific literature. The corpus annotation covers mentions of both healthy and pathological anatomical entities and contains over 3,000 annotated mentions. |