Open
Description
- uid: s2orc_the_semantic_scholar_open_research_corpus
- type: primary
- description:
- name: S2ORC: The Semantic Scholar Open Research Corpus
- description: Largest collection of machine-readable English-language open-access scientific literature formatted to support NLP research. 136M papers with titles and abstracts, including 12.7M papers with full text. Unifies popular resources like PubMed Central (Biomedicine) and arXiv (Physics, Math, CS) with papers sourced across many different academic disciplines. Maintained by the Semantic Scholar Research team at AI2. https://aclanthology.org/2020.acl-main.447/
- homepage: https://github.com/allenai/s2orc
- validated: True
- languages:
- language_names:
- English
- language_comments:
- language_locations:
- World-Wide
- validated: False
- language_names:
- custodian:
- name: Semantic Scholar / Allen Institute for AI
- in_catalogue:
- type: A nonprofit/NGO (other)
- location: United States of America
- contact_name: Kyle Lo
- contact_email: [email protected]
- contact_submitter: True
- additional: http://allenai.org/
- validated: False
- availability:
- procurement:
- for_download: Yes - after signing a user agreement
- download_url: https://docs.google.com/forms/d/1fUqUw68dDMnzFt58WgMi-FI33MPcVFpflN2G3Yjfn9c/edit
- download_email:
- licensing:
- has_licenses: Yes
- license_text:
- license_properties:
- non-commercial use
- license_list:
- cc-by-nc-2.0: Creative Commons Attribution Non Commercial 2.0 Generic
- pii:
- has_pii: Yes
- generic_pii_likely: very likely
- generic_pii_list:
- names
- email addresses
- physical addresses
- URLs
- website account name or handle
- numeric_pii_likely: somewhat likely
- numeric_pii_list:
- telephone numbers
- sensitive_pii_likely: unlikely
- sensitive_pii_list:
- no_pii_justification_class:
- no_pii_justification_text:
- validated: False
- procurement:
- source_category:
- category_type: collection
- category_web:
- category_media: scientific articles/journal
- validated: False
- media:
- category:
- text
- text_format:
- .XHTML
- .TXT
- .CSV
- .TEX
- other
- .JSON
- audiovisual_format:
- image_format:
- other
- database_format:
- .TAR
- .JSON
- .GZIP
- .TGZ
- text_is_transcribed: Yes - image
- instance_type: article
- instance_count: 1M<n<1B
- instance_size: 100<n<10,000
- validated: False
- category:
- fname: s2orc_the_semantic_scholar_open_research_corpus.json