Overview of the Biomedical Datasets

We released our raw instruction data (Taiyi_Instruction_Data_001). The data is distributed under CC BY-NC-SA 4.0. The original benchmark datasets that support this study are available from the official websites of natural language processing challenges with Data Use Agreements.

Overview of the Biomedical Datasets
- Chinese Datasets
- English Datasets
Summary of final instruction data
Task Schema Harmonization

Chinese Datasets

Num	Dataset	Task	Size	URL	Task Description
1	CMedCausal	CRE	Train:800; Dev:200	[Dataset][paper]	本次任务主要目标是从大量的医疗的问答和知识类的文本中可以挖掘抽取医疗因果关系构建因果关系解释网络，具体来说，细分的医学概念片段分为如下3种类型：1、因果关系：因果关系是指某种原因直接导致某种结果。2、条件关系：条件关系是指医学概念片段中一些特定的条件，用于修饰特定的因果关系，注意：条件并不能直接导致某个结果的发生。3、上下位关系：上下位关系指的是概念之间的大小关系。其中条件关系存在头实体为实体，尾实体为关系的情况。关系具有方向性。
2	CHIP-CDEE	EE	Train:1,587; Dev:384	[Dataset][paper]	有的事件属性不是实体。本次任务主要目标是从中文电子病历中挖掘出临床发现事件。即给定一段现病史或者医学影像所见报告，要求从中抽取临床发现事件的四个属性: 解剖部位、主体词、描述词，以及发生状态。
3	MedDialog-zh	MRD	Train:2,725,989,; Dev:340,748; Test:340,754	[Dataset][paper]	原始的中文对话来自于haodf.com. 英文对话来自于 healthcaremagic.com and icliniq.com。
4	xywy_disease	MRD	Train:8,808	[Dataset][paper]	疾病百科。
5	MedDG	MRD	Train:17,864; Dev: 2,747	[Dataset][paper]	任务具体定义如下：给定医生和患者交流的前K句对话历史H，其中Xk为患者当前的对话语句，并假定医生的下一句回复Xk+1包含标注的实体列表E，模型要求根据对话历史H预测出下一句回复Xk+1。同时，生成的回复中应当包含尽可能准确的医疗实体（E中的实体）。本任务基于一个带有实体标注的医疗对话数据集MedDG，涉及12种消化内科相关疾病，包含超过1.7万个对话和38万个语句。MedDG中每个对话都标注了疾病、症状、程度、检查、药物五大类相关实体共160种。
6	CMeEE-V2	NER	Train:15,000; Dev:5,000	[Dataset][paper]	任务为面向中文医学文本的命名实体识别，即给定schema及句子sentence，对于给定的一组纯医学文本文档，任务的目标是识别并抽取出与医学临床相关的实体，并将他们归类到预先定义好的类别。将医学文本命名实体划分为九大类，包括：疾病(dis)，临床表现(sym)，药物(dru)，医疗设备(equ)，医疗程序(pro)，身体(bod)，医学检验项目(ite)，微生物类(mic)，科室(dep)。
7	TCM-NER	NER	Train:1,997	[Dataset][paper]	本任务为面向中药说明书的命名实体识别，任务要求是识别并抽取中药说明书文本中的相关实体并分到预先定义好的13类：药品、药物成分、疾病、症状、证候、疾病分组、食物、食物分组、人群、药品分组、药物剂型、药物性味、中药功效。
8	CHIP-CDN	NER	Train:6,000; Dev:2,000	[Dataset][paper]	本次评测任务主要目标是针对中文电子病历中挖掘出的真实诊断实体进行语义标准化。给定一诊断原词，要求给出其对应的诊断标准词。直接给的候选实体，不用NER直接标准化。
9	Clinical_Terminology_Standardization	NER	Train:4,000; Dev:1,000; Test:2,000	[Dataset][paper]	该评估任务的主要目标是对从中国电子病历中提取的真实外科实体的语义进行标准化。给定一个操作原始字，则需要相应的操作标准字。手术的所有原始单词均来自真实医学数据，并根据“icd9-2017协和医学院临床版”的手术词汇进行标注。
10	IMCS-V2-SR	NER	Train:2,472; Dev:833; Test:811	[Dataset][paper]	根据医患对话文本，识别出病人具有的症状信息（包含归一化标签和类别标签）。
11	IMCS-V2-NER	NER/NEN	Train:2,472; Dev:833	[Dataset][paper]	从医患对话文本中识别出五类重要的医疗相关实体，其中症状实体包含归一化标签和类别标签。
12	DiaKG	NER,RE	Train:2,292	[Dataset][paper]	句子级关系抽取，本数据集来源于41篇中文糖尿病领域专家共识，数据包括基础研究、临床研究、药物使用、临床病例、诊治方法等多个方面，时间跨度达到7年，涵盖了近年来糖尿病领域最广泛的研究内容和热点。数据集的标注者都具有医学背景，共标注了22,050个医学实体和6,890对实体关系。依托于该数据集，包括医生、科研人员、企业开发者就能开展用于临床诊断的知识库，知识图谱，辅助诊断等产品开发，进一步探索研究糖尿病的奥秘。
13	TCM_Literature_QA	QA-cqa	Train:18,478	[Dataset][paper]	阅读理解型问答。本数据集来自天池中医药问题生成大赛。中医文献问题生成数据集包含3500篇语料。每篇文档由人工标注产生1～4对(问题, 答案)对。
14	MEDQA	QA-multiple_choice	Train:27,400; Dev:3,425; Test:3,426	[Dataset][paper]	国家医学委员会考试选择题：一个问题4、5个选项。
15	cMedQA2	QA-sqa	Train:100,000; Dev:4,000; Test:4,000	[Dataset][paper]	网络上爬取的单轮问答，有的问题有个多个答案。
16	huatuo_encyclopedia_qa	QA-sqa	Train:362,000; Dev:1,000; Test:1,000	[Dataset][paper]	单轮问答，华佗对话数据集用来训练大模型，中文维基百科（https://zh.wikipedia.org/wiki/）、千问健康（https://51zyzy.com/）。
17	huatuo_knowledge_graph_qa	QA-sqa	Train:796,000; Dev:1,000; Test:1,000	[Dataset][paper]	知识图谱产生的答案，回答直接是短语。CPubMed-KG，39Health-KG。
18	huatuo_consultation_qa	QA-sqa	Train:25,341,578; Dev:1,000; Test:1,000	[Dataset][paper]	该问题的回答是一个问题加网址的形式：[ "你好！请问睡觉睡到半夜总是口干舌干的、影响休息、是..." ] ["https://www.51zyzy.com/question/detail/10391424.html" ]。
19	Chinese-medical-dialogue-data	QA-sqa	Train:792,099	[Dataset][paper]	数据有六个文件夹：<Andriatria_男科>94596个问答对 <IM_内科> 220606个问答对 <OAGD_妇产科> 183751个问答对 <Oncology_肿瘤科> 75553个问答对 <Pediatric_儿科>101602个问答对 <Surgical_外科> 115991个问答对总计 792099个问答对。
20	BenTsao_llamadata	QA-sqa	Train:8,658	[Dataset][paper]	单轮简单问答。
21	BenTsao_livercacer	QA-sqa	Train:1,000	[Dataset][paper]	单轮简单问答。
22	HuatuoGPT-sft-data-v1	QA-sqa	Train:226,042	[Dataset][paper]	来自huatuogpt。
23	webMedQA	QA-sqa	Train:252,850; Dev:31,605; Test:31,655	[Dataset][paper]	从在线健康咨询网站收集的真实中文医疗问答数据集。
24	CMeIE-V2	RE	Train:14,339; Dev:3,585	[Dataset][paper]	句子级别关系抽取，本次任务给定schema约束集合及句子sentence，其中schema定义了关系Predicate以及其对应的主体Subject和客体Object的类别，例如：（“subject_type”:“疾病”，“predicate”: “药物治疗”，“object_type”:“药物”），任务要求参评系统自动地对句子进行分析，输出句子中所有满足schema约束的SPO三元组知识Triples=[(S1, P1, O1), (S2, P2, O2)…]。
25	CMID	TC	Train:12,254	[Dataset][paper]	中文医学意图数据集，对输入文本进行意图分类。
26	CHIP-CTC	TC	Train:22,962; Dev:7,682	[Dataset][paper]	本次评测任务是给定事先定义好的44种筛选标准语义类别和一系列中文临床试验筛选标准的描述句子，参赛者需返回每一条筛选标准的具体类别。
27	KUAKE-QIC	TC	Train:6,931; Dev:1,955	[Dataset][paper]	在医学搜索中，对搜索问题的意图分类可以极大提升搜索结果的相关性，特别是医学知识具备极强的专业性，对问题意图进行分类也有助于融入医学知识来做增强搜索结果的性能。在本次评测中，医学问题分为病情诊断(diagnosis）、病因分析(cause)、治疗方案(method)、就医建议(advice)、指标解读(metric_explain)、疾病描述(disease_express)、后果表述(result)、注意事项(attention)、功效作用(effect)、医疗费用(price)、其他(other) 共11种类型。
28	TCM-SD-TC	TC	Train:43,180; Dev:5,486	[Dataset][paper]	辨证为中医诊疗过程中具有特色的任务之一，现实场景中，中医从业者需要根据望闻问切等方式观察到的病人的情况，推理判断出该病人当前属于哪种证型，然后根据证型对证下药进行治疗。具体地，给定一段由自然语言文本书写的病人的详细情况描述（包括现病史，主诉，四诊信息），模型需要预测出当前病人样例所对应的正确的证型。
29	CHIP-MDCFNPC	TC	Train:5,000; Dev:1,000; Test:2,000	[Dataset][paper]	阴阳性的定义一般认为是患者主诉病情描述和医生诊断判别中的阴性和阳性。SOAP (Subjective, Objective, Assessment, Plan) 评估记录法是目前国际上最常用以问题为导向的医学记录方法，阴阳性需要处理主要是S和A中相关的实体的判别。数据预处理是先对齐进行SOAP分类，然后对S和A的部分进行NER识别，然后在此基础上进行阴阳性的标注。标注属性分为：阴性、阳性、其他、不标注。
30	Medical_Books_zh	TC	Train:33	[Dataset][paper]	无结构书籍文本数据。
31	IMCS-V2-DAC	TC	Train:2,472; Dev:833; Test:811	[Dataset][paper]	从话语中识别医生或者患者的意图（共 16 种）。
32	CHIP-STS	TP-ss	Train:16,000; Dev:4,000	[Dataset][paper]	针对中文的疾病问答数据，进行病种间的迁移学习。具体而言，给定来自5个不同病种的问句对，要求判定两个句子语义是否相同或者相近。label表示问句之间的语义是否相同。若相同，标为1，若不相同，标为0。
33	KUAKE-QQR	TP-ss	Train:15,000; Dev:1,600	[Dataset][paper]	查询词之间的相关性是评估两个Query所表述主题的匹配程度，即判断Query-A和Query-B是否发生转义，以及转义的程度。Query即搜索词，包括用户在搜索框中输入的词、数字、符号等内容，Query的主题是指query的专注点，用户在输入query是希望找到与query主题相关的网页。判定两个查询词之间的相关性是一项重要的任务，常用于长尾query的搜索质量优化场景，Query和Query的相关度共分为3档（0-2），0分为相关性最差，2分表示相关性最好。句子较短。
34	KUAKE-QTR	TP-ss	Train:24,174; Dev:2,913; Test:5,465	[Dataset][paper]	在医疗搜索中，评估搜索词(Query)表述主题和落地页标题(Title)表述主题的匹配程度是一项重要的任务，关系到搜索结果的准确性。Query的主题是指query的专注点,用户在输入query是希望找到与query主题相关的网页。Query和Title的相关度共分为4档（0-3），0分为最差，3分为匹配最好。
35	COVID_Similarity	TP-ss	Test:7,032	[Dataset][paper]	本数据集来自天池新冠疫情相似句对判定大赛，医疗问题涉及“肺炎”、“支原体肺炎”、“支气管炎”、“上呼吸道感染”、“肺结核”、“哮喘”、“胸膜炎”、“肺气肿”、“感冒”、“咳血”等10个病种。
36	IMCS-V2-SR	TP-ss	Train:100,000; Dev:1,000; Test:3,000	[Dataset][paper]	根据医患对话文本，识别出病人具有的症状信息（包含归一化标签和类别标签）。
37	IMCS-V2-MRG	TT-ts	Train:2,472; Dev:833	[Dataset][paper]	依据病人自述和医患对话，输出具有规定格式的医疗报告。一个对话提供了2个生成报告。
38	Text2DT	TT-ts	Train:300; Dev:100	[Dataset][paper]	本次任务的目标是从给定的医疗文本抽取出诊疗决策树。诊疗决策树表示简化的决策过程，即根据条件判断的不同结果做出下一个条件判断或决策。一旦做出决策，诊疗过程终止。因此，Text2DT将诊疗决策树定义为由条件节点和决策节点组成的二叉树。本任务既要求模型能够将文本中的核心实体和关系挖掘出来，也需要将这些信息进行串联，形成一个完整的决策流程。

English Datasets

Num	Dataset	Task	Size	URL	Task Description
1	paramed	MT	Train:62,127; Dev:2,036; Test:2,102	[Dataset][paper]	NEJM is a Chinese-English parallel corpus crawled from the New England Journal of Medicine website. English articles are distributed through https://www.nejm.org/ and Chinese articles are distributed through http://nejmqianyan.cn/. The corpus contains all article pairs (around 2000 pairs) since 2011.
2	medal	NER	Train:300,000; Dev:100,000; Test:100,000	[Dataset][paper]	The Repository for Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain.
3	anat_em	NER	Train:606; Dev:202; Test:404	[Dataset][paper]	Anatomical entity mention recognition: the recognition of mentions of anatomical entities, organism parts at levels of organization between the molecular and the whole organism.
4	chemdner	NER	Train:3,500; Dev:3,500; Test:3,000	[Dataset][paper]	Chemical compound and drug name recognition task: detect mentions of chemical compounds and drugs from the text.
5	scai_disease	NER	Train:400	[Dataset][paper]	SCAI Disease is a dataset annotated in 2010 with mentions of diseases and adverse effects. It is a corpus containing 400 randomly selected MEDLINE abstracts generated using ‘Disease OR Adverse effect’ as a PubMed query. This evaluation corpus was annotated by two individuals who hold a Master’s degree in life sciences.
6	tmvar_v1	NER	Train:334; Test:166	[Dataset][paper]	Extracting sequence variants in biomedical literature.
7	scai_chemical	NER	Train:100	[Dataset][paper]	SCAI Chemical is a corpus of MEDLINE abstracts that has been annotated to give an overview of the different chemical name classes found in MEDLINE text.
8	nlmchem	NER	Train:80; Dev:20; Test:50	[Dataset][paper]	NLM-Chem corpus consists of 150 full-text articles from the PubMed Central Open Access dataset, comprising 67 different chemical journals, aiming to cover a general distribution of usage of chemical names in the biomedical literature. Articles were selected so that human annotation was most valuable (meaning that they were rich in bio-entities, and current state-of-the-art named entity recognition systems disagreed on bio-entity recognition.
9	ask_a_patient	NER	Train:156,652; Dev:7,926; Test:8,662	[Dataset][paper]	The AskAPatient dataset contains medical concepts written on social media mapped to how they are formally written in medical ontologies (SNOMED-CT and AMT).
10	citation_gia_test_collection	NER	Test:151	[Dataset][paper]	The Citation GIA Test Collection was recently created for gene indexing at the NLM and includes 151 PubMed abstracts with both mention-level and document-level annotations. They are selected because both have a focus on human genes.
11	linnaeus	NER	Train:95	[Dataset][paper]	Linnaeus is a novel corpus of full-text documents manually annotated for species mentions.
12	mirna	NER	Train:201; Test:100	[Dataset][paper]	The corpus consists of 301 Medline citations. The documents were screened for mentions of miRNA in the abstract text. Gene, disease and miRNA entities were manually annotated. The corpus comprises of two separate files, a train and a test set, coming from 201 and 100 documents respectively.
13	osiris	NER	Train:105	[Dataset][paper]	The OSIRIS corpus is a set of MEDLINE abstracts manually annotated with human variation mentions. The corpus is distributed under the terms of the Creative Commons Attribution License Creative Commons Attribution 3.0 Unported License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (Furlong et al, BMC Bioinformatics 2008, 9:84).
14	tmvar_v2	NER	Train:158	[Dataset][paper]	This dataset contains 158 PubMed articles manually annotated with mutation mentions of various kinds and dbsnp normalizations for each of them. It can be used for NER tasks and NED tasks, This dataset has a single split.
15	tmvar_v3	NER	Test:500	[Dataset][paper]	This dataset contains 500 PubMed articles manually annotated with mutation mentions of various kinds and dbsnp normalizations for each of them. In addition, it contains variant normalization options such as allele-specific identifiers from the ClinGen Allele Registry It can be used for NER tasks and NED tasks, This dataset does NOT have splits.
16	twadrl	NER	Train:48,200; Dev:1,300; Test:1,430	[Dataset][paper]	The TwADR-L dataset contains medical concepts written on social media (Twitter) mapped to how they are formally written in medical ontologies (SIDER 4).
17	cellfinder	NER	Train:5; Test:5	[Dataset][paper]	The CellFinder project aims to create a stem cell data repository by linking information from existing public databases and by performing text mining on the research literature. The first version of the corpus is composed of 10 full text documents containing more than 2,100 sentences, 65,000 tokens and 5,200 annotations for entities. The corpus has been annotated with six types of entities (anatomical parts, cell components, cell lines, cell types, genes/protein and species) with an overall inter-annotator agreement around 80%.
18	ebm_pico	NER	Train:4,746; Test:187	[Dataset][paper]	This corpus release contains 4,993 abstracts annotated with (P)articipants, (I)nterventions, and (O)utcomes. Training labels are sourced from AMT workers and aggregated to reduce noise. Test labels are collected from medical professionals.
19	genetag	NER	Train:7,500; Dev:5,000; Test:2,500	[Dataset][paper]	The annotation of such a corpus for gene/protein name NER is a difficult process due to the complexity of gene/protein names. We describe the construction and annotation of GENETAG, a corpus of 20K MEDLINE® sentences for gene/protein NER. 15K GENETAG sentences were used for the BioCreAtIvE Task 1A Competition.
20	pico_extraction	NER	Train:421	[Dataset][paper]	This dataset contains annotations for Participants, Interventions, and Outcomes (referred to as PICO task). For 423 sentences, annotations collected by 3 medical experts are available. To get the final annotations, we perform the majority voting.
21	progene	NER	Train:309,267; Dev:16,769; Test:36,234	[Dataset][paper]	The Protein/Gene corpus was developed at the JULIE Lab Jena under supervision of Prof. Udo Hahn. The executing scientist was Dr. Joachim Wermter. The main annotator was Dr. Rico Pusch who is an expert in biology. The corpus was developed in the context of the StemNet project (http://www.stemnet.de/).
22	n2c2_2009_medication	NER	Train:10; Test:251	[Dataset][paper]	The Third i2b2 Workshop on Natural Language Processing Challenges for Clinical Records focused on the identification of medications, their dosages, modes (routes) of administration, frequencies, durations, and reasons for administration in discharge summaries.
23	n2c2_2006_deid	NER	Train:669; Test:220	[Dataset][paper]	The data for the de-identification challenge came from Partners Healthcare and included solely medical discharge summaries. We prepared the data for the challengeby annotating and by replacing all authentic PHI with realistic surrogates. Given the above definitions, we marked the authentic PHI in the records in two stages.
24	n2c2_2014_deid	NER	Train:790; Test:514	[Dataset][paper]	The de-identification track focused on identifying protected health information (PHI) in longitudinal clinical narratives. TRACK 1: NER PHIn HIPAA requires that patient medical records have all identifying information removed in order to protect patient privacy. There are 18 categories of Protected Health Information (PHI) identifiers of the patient or of relatives, employers, or household members of the patient that must be removed in order for a file to be considered de-identified.
25	gnormplus	NER/NEN	Train:432; Test:262	[Dataset][paper]	Identify gene/protein names in biomedical literature, including gene/protein mentions, family names and domain names.
26	ncbi_disease	NER/NEN	Train:592; Dev:100; Test:100	[Dataset][paper]	Automatic disease recognition from free text.
27	nlm_gene	NER/NEN	Train:450; Test:100	[Dataset][paper]	Extract gene entities from the text.
28	medmentions	NER/NEN	Train:2,635; Dev:878; Test:879	[Dataset][paper]	MedMentions is a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines.
29	mediqa_qa	QA-cqa	Train:208; Dev:25; Test:150	[Dataset][paper]	The MEDIQA challenge is an ACL-BioNLP 2019 shared task aiming to attract further research efforts in Natural Language Inference (NLI), Recognizing Question Entailment (RQE), and their applications in medical Question Answering (QA).
30	emrQA	QA-cqa	Train:117,678	[Dataset][paper]	emrQA is a semi-automatically generated large-scale question answering dataset on clinical notes available across several i2b2 challlenges (until 2014).
31	DrugEHRQA	QA-cqa	query1:505; query2:8,218; query3:5,517; query4:7,291; query5:5,952; query6:809; query7:6,015; query8:5,302; query9:2,093	[Dataset][paper]	DrugEHRQA is the first question answering (QA) dataset containing question-answers from both structured tables and discharge summaries of MIMIC-III.
32	med_qa	QA-multiple_choice	Train:10,178; Dev:1,272; Test:1,273	[Dataset][paper]	The first free-form multiple-choice OpenQA dataset for solving medical problems, MedQA, collected from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively. Together with the question data, we also collect and release a large-scale corpus from medical textbooks from which the reading comprehension models can obtain necessary knowledge for answering the questions.
33	biomrc	QA-multiple_choice	Train:700,000; Dev:50,000; Test:62,707	[Dataset][paper]	A large-scale cloze-style biomedical MRC dataset. Care was taken to reduce noise, compared to the previous BIOREAD dataset of Pappas et al. (2018).
34	evidence_inference	QA-multiple_choice	Train:10,056; Dev:1,233; Test:1,222	[Dataset][paper]	The dataset consists of biomedical articles describing randomized control trials (RCTs) that compare multiple treatments. Each of these articles will have multiple questions, or 'prompts' associated with them. These prompts will ask about the relationship between an intervention and comparator with respect to an outcome, as reported in the trial. For example, a prompt may ask about the reported effects of aspirin as compared to placebo on the duration of headaches. For the sake of this task, we assume that a particular article will report that the intervention of interest either significantly increased, significantly decreased or had significant effect on the outcome, relative to the comparator.
35	medhop	QA-multiple_choice	Train:1,620; Dev:342	[Dataset][paper]	With the same format as WikiHop, this dataset is based on research paper abstracts from PubMed, and the queries are about interactions between pairs of drugs. The correct answer has to be inferred by combining information from a chain of reactions of drugs and proteins.
36	sciq	QA-multiple_choice	Train:1,1679; Dev:1,000; Test:1,000	[Dataset][paper]	The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For most questions, an additional paragraph with supporting evidence for the correct answer is provided.
37	MedMCQA	QA-multiple_choice	Train:70,735; Dev:4,183	[Dataset][paper]	A large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.
38	LiveQA	QA-sqa	Train:446; Test:104	[Dataset][paper]	The LiveQA'17 medical task focuses on consumer health question answering. We use consumer health questions received by the U.S. National Library of Medicine (NLM).
39	Medication_QA	QA-sqa	Train:690	[Dataset][paper]	This task focuses on real consumer health question answering. The data consists of six hundred and seventy-four question-answer pairs with annotations of the question focus and type and the answer source.
40	bionlp_st_2011_rel	NER,RE,COREF	Train:800; Dev:150; Test:260	[Dataset][paper]	The Entity Relations (REL) task is a supporting task of the BioNLP Shared Task 2011. The task concerns the extraction of two types of part-of relations between a gene/protein and an associated entity.
41	iCliniq-10k	QA-sqa	Train:21,963	[Dataset][paper]	ChatDoctor data: 100k real conversations between patients and doctors from HealthCareMagic.com.
42	HealthCareMagic-100k	QA-sqa	Train:112,164	[Dataset][paper]	ChatDoctor data: 100k real conversations between patients and doctors from HealthCareMagic.com.
43	biology_how_why_corpus	QA-sqa	Train:1,270	[Dataset][paper]	This dataset consists of 185 "how" and 193 "why" biology questions authored by a domain expert, with one or more gold answer passages identified in an undergraduate textbook. The expert was not constrained in any way during the annotation process, so gold answers might be smaller than a paragraph or span multiple paragraphs. This dataset was used for the question-answering system described in the paper “Discourse Complements Lexical Semantics for Non-factoid Answer Reranking” (ACL 2014).
44	pubmed_qa	QA-yesno	Train:450; Dev:50; Test:500	[Dataset][paper]	PubMedQA is a novel biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research biomedical questions with yes/no/maybe using the corresponding abstracts. PubMedQA has 1k expert-annotated (PQA-L), 61.2k unlabeled (PQA-U) and 211.3k artificially generated QA instances (PQA-A).
45	iepa	RE	Train:161; Test:41	[Dataset][paper]	The IEPA benchmark PPI corpus is designed for relation extraction. It was created from 303 PubMed abstracts, each of which contains a specific pair of co-occurring chemicals.
46	lll	RE	Train:77	[Dataset][paper]	The LLL05 challenge task is to learn rules to extract protein/gene interactions from biology abstracts from the Medline bibliography database. The goal of the challenge is to test the ability of the participating IE systems to identify the interactions and the gene/proteins that interact.
47	genia_relation_corpus	RE	Train:800; Dev:150; Test:260	[Dataset][paper]	The extraction of various relations stated to hold between biomolecular entities is one of the most frequently addressed information extraction tasks in domain studies. Typical relation extraction targets involve protein-protein interactions or gene regulatory relations. However, in the GENIA corpus, such associations involving change in the state or properties of biomolecules are captured in the event annotation. The GENIA corpus relation annotation aims to complement the event annotation of the corpus by capturing (primarily) static relations, relations such as part-of that hold between entities without (necessarily) involving change.
48	MIMICause	RE	Train:2,714	[Dataset][paper]	identify types and direction of causal relations between a pair of biomedical concepts in clinical notes; communicated implicitly or explicitly, identified either in a single sentence or across multiple sentences.
49	bc7_litcovid	TC	Train:24,960; Dev:2,500; Test:6,239	[Dataset][paper]	The training and development datasets contain the publicly-available text of over 30 thousand COVID-19-related articles and their metadata (e.g., title, abstract, journal). Articles in both datasets have been manually reviewed and articles annotated by in-house models.
50	geokhoj_v1	TC	Train:25,000; Test:5,000	[Dataset][paper]	GEOKhoj v1 is a annotated corpus of control/perturbation labels for 30,000 samples from Microarray, Transcriptomics and Single cell experiments which are available on the GEO (Gene Expression Omnibus) database.
51	gad	TC	Train:4,261	[Dataset][paper]	A corpus identifying associations between genes and diseases by a semi-automatic annotation procedure based on the Genetic Association Database.
52	hallmarks_of_cancer	TC	Train:12,119; Dev:1,798; Test:3,547	[Dataset][paper]	The Hallmarks of Cancer (HOC) Corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to a taxonomy. The taxonomy consists of 37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the corpus. The labels are found under the "labels" directory, while the tokenized text can be found under "text" directory. The filenames are the corresponding PubMed IDs (PMID).
53	meddialog	TC	Train:981; Dev:126; Test:122	[Dataset][paper]	The MedDialog dataset (English) contains conversations (in English) between doctors and patients.It has 0.26 million dialogues. The data is continuously growing and more dialogues will be added. The raw dialogues are from healthcaremagic.com and icliniq.com. All copyrights of the data belong to healthcaremagic.com and icliniq.com.
54	pubhealth	TC	Train:9,804; Dev:1,223; Test:1,231	[Dataset][paper]	A dataset of 11,832 claims for fact- checking, which are related a range of health topics including biomedical subjects (e.g., infectious diseases, stem cell research), government healthcare policy (e.g., abortion, mental health, women’s health), and other public health-related stories.
55	scicite	TC	Train:8,243; Dev:916; Test:1,861	[Dataset][paper]	SciCite is a dataset of 11K manually annotated citation intents based on citation context in the computer science and biomedical domains.
56	n2c2_2006_smokers	TC	Train:398; Test:104	[Dataset][paper]	The data for the n2c2 2006 smoking challenge consisted of discharge summaries from Partners HealthCare, which were then de-identified, tokenized, broken into sentences, converted into XML format, and separated into training and test sets. Two pulmonologists annotated each record with the smoking status of patients based strictly on the explicitly stated smoking-related facts in the records.
57	n2c2_2008_obesity	TC	Train:730; Test:507	[Dataset][paper]	The data for the n2c2 2008 obesity challenge consisted of discharge summaries from the Partners HealthCare Research Patient Data Repository. These data were chosen from the discharge summaries of patients who were overweight or diabetic and had been hospitalized for obesity or diabetes sometime since 12/1/04. De-identification was performed semi-automatically.
58	n2c2_2018_track1	TC	Train:202; Test:86	[Dataset][paper]	Track 1 of the 2018 National NLP Clinical Challenges shared tasks focused on identifying which patients in a corpus of longitudinal medical records meet and do not meet identified selection criteria. This shared task aimed to determine whether NLP systems could be trained to identify if patients met or did not meet a set of selection criteria taken from real clinical trials. T
59	scifact	TC	Train:809; Dev:300; Test:300	[Dataset][paper]	SciFact is a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts, and annotated with labels and rationales. This config has abstracts and document ids. {_DESCRIPTION_BASE} This config connects the claims to the evidence and doc ids. {_DESCRIPTION_BASE} This task is the following: given a claim and a text span composed of one or more sentences from an abstract, predict a label from ("rationale", "not_rationale") indicating if the span is evidence (can be supporting or refuting) for the claim. This roughly corresponds to the second task outlined in Section 5 of the paper." {_DESCRIPTION_BASE} This task is the following: given a claim and a text span composed of one or more sentences from an abstract, predict a label from ("SUPPORT", "NOINFO", "CONTRADICT") indicating if the span supports, provides no info, or contradicts the claim. This roughly corresponds to the thrid task outlined in Section 5 of the paper.
60	bio_sim_verb	TP-ss	Train:1,000	[Dataset][paper]	This repository contains the evaluation datasets for the paper Bio-SimVerb and Bio-SimLex: Wide-coverage Evaluation Sets of Word Similarity in Biomedicine by Billy Chiu, Sampo Pyysalo and Anna Korhonen.
61	bio_simlex	TP-ss	Train:988	[Dataset][paper]	Bio-SimLex enables intrinsic evaluation of word representations. This evaluation can serve as a predictor of performance on various downstream tasks in the biomedical domain. The results on Bio-SimLex using standard word representation models highlight the importance of developing dedicated evaluation resources for NLP in biomedicine for particular word classes (e.g. verbs).
62	biosses	TP-ss	Train:128; Dev:32; Test:40	[Dataset][paper]	BIOSSES computes similarity of biomedical sentences by utilizing WordNet as the general domain ontology and UMLS as the biomedical domain specific ontology.
63	ehr_rel	TP-ss	Train:3,741	[Dataset][paper]	EHR-Rel is a novel open-source1 biomedical concept relatedness dataset consisting of 3630 concept pairs, six times more than the largest existing dataset. Instead of manually selecting and pairing concepts as done in previous work, the dataset is sampled from EHRs to ensure concepts are relevant for the EHR concept retrieval task. A detailed analysis of the concepts in the dataset reveals a far larger coverage compared to existing datasets.
64	mayosrs	TP-ss	Train:101	[Dataset][paper]	MayoSRS consists of 101 clinical term pairs whose relatedness was determined by nine medical coders and three physicians from the Mayo Clinic.
65	minimayosrs	TP-ss	Train:29	[Dataset][paper]	MiniMayoSRS is a subset of the MayoSRS and consists of 30 term pairs on which a higher inter-annotator agreement was achieved. The average correlation between physicians is 0.68. The average correlation between medical coders is 0.78.
66	mqp	TP-ss	Train:3,048	[Dataset][paper]	Medical Question Pairs dataset by McCreery et al (2020) contains pairs of medical questions and paraphrased versions of the question prepared by medical professional. Paraphrased versions were labelled as similar (syntactically dissimilar but contextually similar ) or dissimilar (syntactically may look similar but contextually dissimilar). Labels 1: similar, 0: dissimilar.
67	umnsrs	TP-ss	Train:566	[Dataset][paper]	UMNSRS, developed by Pakhomov, et al., consists of 725 clinical term pairs whose semantic similarity and relatedness. The similarity and relatedness of each term pair was annotated based on a continuous scale by having the resident touch a bar on a touch sensitive computer screen to indicate the degree of similarity or relatedness.
68	scitail	TP-te	Train:23,596; Dev:1,304; Test:2,126	[Dataset][paper]	The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis. We use information retrieval to obtain relevant text from a large text corpus of web sentences, and use these sentences as a premise P. We crowdsource the annotation of such premise-hypothesis pair as supports (entails) or not (neutral), in order to create the SciTail dataset. The dataset contains 27,026 examples with 10,101 examples with entails label and 16,925 examples with neutral label.
69	mediqa_rqe	TP-te	Train:8,588; Dev:302; Test:230	[Dataset][paper]	The MEDIQA challenge is an ACL-BioNLP 2019 shared task aiming to attract further research efforts in Natural Language Inference (NLI), Recognizing Question Entailment (RQE), and their applications in medical Question Answering (QA).
70	meqsum	TT-ds	Train:1,000	[Dataset][paper]	Dataset for medical question summarization introduced in the ACL 2019 paper "On the Summarization of Consumer Health Questions". Question understanding is one of the main challenges in question answering.
71	multi_xscience	TT-ds	Train:30,369; Dev:5,066; Test:5,093	[Dataset][paper]	Multi-document summarization is a challenging task for which there exists little large-scale datasets. We propose Multi-XScience, a large-scale multi-document summarization dataset created from scientific articles. Multi-XScience introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references. Our work is inspired by extreme summarization, a dataset construction protocol that favours abstractive modeling approaches. Descriptive statistics and empirical results---using several state-of-the-art models trained on the Multi-XScience dataset---reveal t hat Multi-XScience is well suited for abstractive models.
72	bionlp_st_2011_epi	EE,NER,COREF	Train:600; Dev:200	[Dataset][paper]	The dataset of the Epigenetics and Post-translational Modifications (EPI) task of BioNLP Shared Task 2011.
73	bionlp_st_2011_ge	EE,NER,COREF	Train:908; Test:347	[Dataset][paper]	The BioNLP-ST GE task has been promoting development of fine-grained information extraction (IE) from biomedical documents, since 2009. Particularly, it has focused on the domain of NFkB as a model domain of Biomedical IE. The GENIA task aims at extracting events occurring upon genes or gene products, which are typed as "Protein" without differentiating genes from gene products. Other types of physical entities, e.g. cells, cell components, are not differentiated from each other, and their type is given as "Entity".
74	bionlp_st_2013_cg	EE,NER,COREF	Train:300; Dev:100; Test:200	[Dataset][paper]	the Cancer Genetics (CG) is a event extraction task and a main task of the BioNLP Shared Task (ST) 2013. The CG task is an information extraction task targeting the recognition of events in text, represented as structured n-ary associations of given physical entities. In addition to addressing the cancer domain, the CG task is differentiated from previous event extraction tasks in the BioNLP ST series in addressing a wide range of pathological processes and multiple levels of biological organization, ranging from the molecular through the cellular and organ levels up to whole organisms. Final test set submissions were accepted from six teams.
75	bionlp_st_2013_pc	EE,NER,COREF	Train:260; Dev:90; Test:175	[Dataset][paper]	the Pathway Curation (PC) task is a main event extraction task of the BioNLP shared task (ST) 2013. The PC task concerns the automatic extraction of biomolecular reactions from text. The task setting, representation and semantics are defined with respect to pathway model standards and ontologies (SBML, BioPAX, SBO) and documents selected by relevance to specific model reactions. Two BioNLP ST 2013 participants successfully completed the PC task. The highest achieved F-score, 52.8%, indicates that event extraction is a promising approach to supporting pathway curation efforts.
76	bionlp_st_2011_id	EE,NER,COREF	Train:152; Dev:46; Test:118	[Dataset][paper]	The dataset of the Infectious Diseases (ID) task of BioNLP Shared Task 2011.
77	genia_ptm_event_corpus	EE,NER,COREF	Train:112	[Dataset][paper]	Post-translational-modiﬁcations (PTM), amino acid modiﬁcations of proteins after translation, are one of the posterior processes of protein biosynthesis for many proteins, and they are critical for determining protein function such as its activity state, localization, turnover and interactions with other biomolecules. While there have been many studies of information extraction targeting individual PTM types, there was until recently little effort to address extraction of multiple PTM types at once in a unified framework.
78	bionlp_shared_task_2009	EE,NER,COREF	Train:800; Dev:150; Test:260	[Dataset][paper]	The BioNLP Shared Task 2009 was organized by GENIA Project and its corpora were curated based on the annotations of the publicly available GENIA Event corpus and an unreleased (blind) section of the GENIA Event corpus annotations, used for evaluation.
79	bionlp_st_2013_ge	EE,NER,RE,COREF	Train:222; Dev:249; Test:305	[Dataset][paper]	The BioNLP-ST GE task has been promoting development of fine-grained information extraction (IE) from biomedical documents, since 2009. Particularly, it has focused on the domain of NFkB as a model domain of Biomedical IE.
80	mlee	EE,NER,RE,COREF	Train:131; Dev:44; Test:87	[Dataset][paper]	MLEE is an event extraction corpus consisting of manually annotated abstracts of papers on angiogenesis. It contains annotations for entities, relations, events and coreferences The annotations span molecular, cellular, tissue, and organ-level processes.
81	bionlp_st_2013_gro	EE,NER,RE,COREF	Train:150; Dev:50; Test:100	[Dataset][paper]	GRO Task: Populating the Gene Regulation Ontology with events and relations. A data set from the bio NLP shared tasks competition from 2013.
82	pdr	EE,NER,RE,COREF	Train:179	[Dataset][paper]	The corpus of plant-disease relation annotated plants and diseases and their relation to PubMed abstract. We annotated NCBI Taxonomy ID and MEDIC ID for plant and disease mention respectively. We annotated 1,307 relations from 199 abstracts, where the numbers of annotated plants and diseases were 1,403 and 1,758, respectively.
83	MedDialog-en	MRD,QA-sqa	dialogue1:34,297; dialogue2:106,185; dialogue3:69,356; dialogue4:16,502	[Dataset][paper]	The MedDialog dataset (English) contains conversations (in English) between doctors and patients. It has 0.26 million dialogues. The data is continuously growing and more dialogues will be added. The raw dialogues are from healthcaremagic.com and icliniq.com.
84	bioinfer	NER,RE	Train:894; Test:206	[Dataset][paper]	A corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers. Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies.
85	ddi_corpus	NER,RE	Train:714; Test:303	[Dataset][paper]	The DDI corpus has been manually annotated with drugs and pharmacokinetics and pharmacodynamics interactions. It contains 1025 documents from two different sources: DrugBank database and MedLine.
86	hprd50	NER,RE	Train:34; Test:9	[Dataset][paper]	HPRD50 is a dataset of randomly selected, hand-annotated abstracts of biomedical papers referenced by the Human Protein Reference Database (HPRD). It is parsed in XML format, splitting each abstract into sentences, and in each sentence there may be entities and interactions between those entities. In this particular dataset, entities are all proteins and interactions are thus protein-protein interactions. Moreover, all entities are normalized to the HPRD database. These normalized terms are stored in each entity's 'type' attribute in the source XML. This means the dataset can determine e.g. that "Janus kinase 2" and "Jak2" are referencing the same normalized entity. Because the dataset contains entities and relations, it is suitable for Named Entity Recognition and Relation Extraction.
87	chebi_nactem	NER,RE	Train:100	[Dataset][paper]	The ChEBI corpus contains 199 annotated abstracts and 100 annotated full papers. All documents in the corpus have been annotated for named entities and relations between these. In total, our corpus provides over 15000 named entity annotations and over 6,000 relations between entities.
88	chia	NER,RE	Train:2,000	[Dataset][paper]	A large annotated corpus of patient eligibility criteria extracted from 1,000 interventional, Phase IV clinical trials registered in ClinicalTrials.gov. This dataset includes 12,409 annotated eligibility criteria, represented by 41,487 distinctive entities of 15 entity types and 25,017 relationships of 12 relationship types.
89	euadr	NER,RE	Train:300	[Dataset][paper]	Corpora with specific entities and relationships annotated are essential to train and evaluate text-mining systems that are developed to extract specific structured information from a large corpus. In this paper we describe an approach where a named-entity recognition system produces a first annotation and annotators revise this annotation using a web-based interface. The agreement figures achieved show that the inter-annotator agreement is much better than the agreement with the system provided annotations. The corpus has been annotated for drugs, disorders, genes and their inter-relationships. For each of the drug-disorder, drug-target, and target-disorder relations three experts have annotated a set of 100 abstracts. These annotated relationships will be used to train and evaluate text-mining software to capture these relationships in texts.
90	seth_corpus	NER,RE	Train:630	[Dataset][paper]	SNP named entity recognition corpus consisting of 630 PubMed citations.
91	verspoor_2013	NER,RE	Train:120	[Dataset][paper]	This dataset contains annotations for a small corpus of full text journal publications on the subject of inherited colorectal cancer. It is suitable for Named Entity Recognition and Relation Extraction tasks. It uses the Variome Annotation Schema, a schema that aims to capture the core concepts and relations relevant to cataloguing and interpreting human genetic variation and its relationship to disease, as described in the published literature. The schema was inspired by the needs of the database curators of the International Society for Gastrointestinal Hereditary Tumours (InSiGHT) database, but is intended to have application to genetic variation information in a range of diseases.
92	bionlp_st_2019_bb	NER,RE	Train:133; Dev:66; Test:96	[Dataset][paper]	The task focuses on the extraction of the locations and phenotypes of microorganisms from PubMed abstracts and full-text excerpts, and the characterization of these entities with respect to reference knowledge sources (NCBI taxonomy, OntoBiotope ontology). The task is motivated by the importance of the knowledge on biodiversity for fundamental research and applications in microbiology.
93	biored	NER,RE	Train:400; Dev:100; Test:100	[Dataset][paper]	Relation Extraction corpus with multiple entity types (e.g., gene/protein, disease, chemical) and relation pairs (e.g., gene-disease; chemical-chemical), on a set of 600 PubMed articles.
94	cpi	NER,RE	Train:1,808	[Dataset][paper]	The compound-protein relationship (CPI) dataset consists of 2,613 sentences from abstracts containing annotations of proteins, small molecules, and their relationships.
95	spl_adr_200db	NER,RE	Train:101	[Dataset][paper]	The United States Food and Drug Administration (FDA) partnered with the National Library of Medicine to create a pilot dataset containing standardised information about known adverse reactions for 200 FDA-approved drugs. The Structured Product Labels (SPLs), the documents FDA uses to exchange information about drugs and other products, were manually annotated for adverse reactions at the mention level to facilitate development and evaluation of text mining tools for extraction of ADRs from all SPLs. The ADRs were then normalised to the Unified Medical Language System (UMLS) and to the Medical Dictionary for Regulatory Activities (MedDRA).
96	jnlpba	NER	Train:18,546; Dev:3,856	[Dataset][paper]	NER For Bio-Entities.
97	n2c2_2010_relation	NER,RE	Train:97; Test:256	[Dataset][paper]	The i2b2/VA corpus contained de-identified discharge summaries from Beth Israel Deaconess Medical Center, Partners Healthcare, and University of Pittsburgh Medical Center (UPMC). The 2010 i2b2/VA Workshop on Natural Language Processing Challenges for Clinical Records comprises three tasks: 1) a concept extraction task focused on the extraction of medical concepts from patient reports; 2) an assertion classification task focused on assigning assertion types for medical problem concepts; 3) a relation classification task focused on assigning relation types that hold between medical problems, tests, and treatments.
98	n2c2_2018_track2	NER,RE	Train:303; Test:202	[Dataset][paper]	The National NLP Clinical Challenges (n2c2), organized in 2018, continued the legacy of i2b2 (Informatics for Biology and the Bedside), adding 2 new tracks and 2 new sets of data to the shared tasks organized since 2006. Track 2 of 2018 n2c2 shared tasks focused on the extraction of medications, with their signature information, and adverse drug events (ADEs) from clinical narratives. This track built on our previous medication challenge, but added a special focus on ADEs.
99	drugprot	NER,RE	Train:3,500; Dev:750	[Dataset][paper]	The DrugProt corpus consists of a) expert-labelled chemical and gene mentions, and (b) all binary relationships between them corresponding to a specific set of biologically relevant relation types.
100	bc5cdr	NER/NEN,RE	Train:500; Dev:500; Test:500	[Dataset][paper]	chemical-induced disease relation extraction: automatic extraction of mechanistic and biomarker chemical-induced disease relations from the biomedical literature.
101	biorelex	NER,RE,COREF	Train:1,405; Dev:201	[Dataset][paper]	BioRelEx is a biological relation extraction dataset. Version 1.0 contains 2010 annotated sentences that describe binding interactions between various biological entities (proteins, chemicals, etc.). 1405 sentences are for training, another 201 sentences are for validation. All sentences contain words "bind", "bound" or "binding". For every sentence we provide: 1) Complete annotations of all biological entities that appear in the sentence 2) Entity types (32 types) and grounding information for most of the proteins and families (links to uniprot, interpro and other databases) 3) Coreference between entities in the same sentence (e.g. abbreviations and synonyms) 4) Binding interactions between the annotated entities 5) Binding interaction types: positive, negative (A does not bind B) and neutral (A may bind to B).
102	an_em	NER,RE,COREF	Train:250; Dev:50; Test:200	[Dataset][paper]	AnEM corpus is a domain- and species-independent resource manually annotated for anatomical entity mentions using a fine-grained classification system. The corpus consists of 500 documents (over 90,000 words) selected randomly from citation abstracts and full-text papers with the aim of making the corpus representative of the entire available biomedical scientific literature. The corpus annotation covers mentions of both healthy and pathological anatomical entities and contains over 3,000 annotated mentions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!