Skip to content

Create dataset s2orc_the_semantic_scholar_open_research_corpus #127

Open
@albertvillanova

Description

@albertvillanova
  • uid: s2orc_the_semantic_scholar_open_research_corpus
  • type: primary
  • description:
    • name: S2ORC: The Semantic Scholar Open Research Corpus
    • description: Largest collection of machine-readable English-language open-access scientific literature formatted to support NLP research. 136M papers with titles and abstracts, including 12.7M papers with full text. Unifies popular resources like PubMed Central (Biomedicine) and arXiv (Physics, Math, CS) with papers sourced across many different academic disciplines. Maintained by the Semantic Scholar Research team at AI2. https://aclanthology.org/2020.acl-main.447/
    • homepage: https://github.com/allenai/s2orc
    • validated: True
  • languages:
    • language_names:
      • English
    • language_comments:
    • language_locations:
      • World-Wide
    • validated: False
  • custodian:
    • name: Semantic Scholar / Allen Institute for AI
    • in_catalogue:
    • type: A nonprofit/NGO (other)
    • location: United States of America
    • contact_name: Kyle Lo
    • contact_email: [email protected]
    • contact_submitter: True
    • additional: http://allenai.org/
    • validated: False
  • availability:
    • procurement:
    • licensing:
      • has_licenses: Yes
      • license_text:
      • license_properties:
        • non-commercial use
      • license_list:
        • cc-by-nc-2.0: Creative Commons Attribution Non Commercial 2.0 Generic
    • pii:
      • has_pii: Yes
      • generic_pii_likely: very likely
      • generic_pii_list:
        • names
        • email addresses
        • physical addresses
        • URLs
        • website account name or handle
      • numeric_pii_likely: somewhat likely
      • numeric_pii_list:
        • telephone numbers
      • sensitive_pii_likely: unlikely
      • sensitive_pii_list:
      • no_pii_justification_class:
      • no_pii_justification_text:
    • validated: False
  • source_category:
    • category_type: collection
    • category_web:
    • category_media: scientific articles/journal
    • validated: False
  • media:
    • category:
      • text
    • text_format:
      • .XHTML
      • .TXT
      • .CSV
      • .TEX
      • other
      • .JSON
    • audiovisual_format:
    • image_format:
      • other
      • .PDF
    • database_format:
      • .TAR
      • .JSON
      • .GZIP
      • .TGZ
    • text_is_transcribed: Yes - image
    • instance_type: article
    • instance_count: 1M<n<1B
    • instance_size: 100<n<10,000
    • validated: False
  • fname: s2orc_the_semantic_scholar_open_research_corpus.json

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions