Skip to content

A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.

Notifications You must be signed in to change notification settings

gkiril/oie-resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 

Repository files navigation

Open Information Extraction (OIE) Resources

A curated list of Open Information Extraction (OIE) resources: research papers, code, data, applications, etc. The list is not limited to Open Information Extraction systems exclusively. It also includes work highly related to OIE, such as taxonomizing open relations and using OIE in downstream applications.

Table of contents

Introduction to OIE

Open Information Extraction (OIE) systems aim to extract unseen relations and their arguments from unstructured text in unsupervised manner. In its simplest form, given a natural language sentence, they extract information in the form of a triple, consisted of subject (S), relation (R) and object (O).

Suppose we have the following input sentence:

AMD, which is based in U.S., is a technology company.

An OIE system aims to make the following extractions:

("AMD"; "is based in"; "U.S.")
("AMD"; "is"; "technology company")

Papers sorted in chronological order

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

Papers grouped by category

Surveys

Evaluation

OIE for downstream applications

OIE's output has been shown to be a useful input for many downstream tasks. In this section, several downstream tasks that benefited from OIE output are listed.

Question Answering

Slot Filling

Event Extraction

Text Summarization

Knowledge Base Population

Knowledge Base Construction

Entity Linking

Relation Linking

Open Link Prediction

Relation Extraction

Relating Entities

Story Comprehension

Text Generation

Video Grounding

OIE in Different Languages

Most of the OIE systems are focused on extractions made from text written on English. However, some OIE systems either are focused on a language other than English, or are multilingual. In this section, OIE systems on languages other than English or multilingual OIE systems are listed.

Multilingual OIE Systems

OIE Systems for German Language

OIE Systems for Portugese Language

OIE Systems for Spanish Language

OIE Systems for Chinese Language

OIE Systems for Persian Language

OIE Systems for Italian Language

OIE Systems for Indonesian Language

OIE Systems for Greek Language

Supervised OIE

Canonicalization of OIE

Slides

Talks

Code

  • MinIE: Open Information Extraction System
    • MinIE: originally written in Java
    • Python wrapper for MinIE
    • MinScIE - an Open Information Extraction system which provides structured knowledge enriched with semantic information about citations (based on MinIE).
    • SalIE - Salient Open Information Extraction (based on MinIE)
  • ClausIE: Clause-based OIE
  • OpenIE at IIT Delhi:
  • OpenIE at UW:
  • Stanford's OpenIE:
  • Graphene: OpenIE system containing coreference resolution, simplification and open relation extraction pipeline
  • EXEMPLAR
  • DefIE: Open information extraction from textual definitions
  • ReMine: Integrating Local and Global Cohesiveness for Open Information Extraction
  • OIE systems for languages other than English or cross-lingual systems:
    • Zhopenie - Chinese OIE: OIE system for Chinese language written in Python.
    • Open Relation Extraction for Chinese: Knowledge triples extraction (entities and relations extraction) and knowledge base construction based on dependency syntax for open domain text (for Chinese)
    • Baaz: Open information extraction from Persian web (Python)
    • MT/IE: Cross-lingual Open IE. Attention-based sequence-to-sequence model for cross-lingual open IE. Written in Python
    • Relation Extraction on German Websites: This repository holds a collection of three Open Information Extraction approaches for the German language
    • DptOIE: A Portuguese Open Information Extraction system based on Dependency Analysis
    • PragmaticOIE: a rule-based approach to extract facts in Portuguese in a first pragmatic level
  • CORE: Context-Aware Open Relation Extraction with Factorization Machines
  • CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information
  • IMPLIE: IMPLIE (IMPLicit relation Information Extraction) is a program that extracts binary relations from English sentences where the relationship between the two entities is not explicitly stated in the text.
  • Ranking: Iterative Rank-Aware Open IE (confidence score).

Data

OIE output is used as a useful input in many other downstream tasks, such as question answering, event schema induction or generating inference rules. Moreover, OIE output can be used as a "fuel" to derive further resources. Here, the data is organized into two major categories: 1) OIE corpora; 2) Resources derived from OIE output.

OIE corpora

  • OPIEC: An Open Information Extraction Corpus: the largest OIE corpus to date, containing more than 341M triples extracted from the entire English Wikipedia. Each triple from the corpus is composed of rich meta-data: each token from the subj / obj / rel along with NLP annotations (POS tag, NER tag, ...), provenance sentence along with the dependency parse, original (golden) left from Wikipedia, sentence order, space / time, etc.
  • [.gz] ReVerb extractions: 15 million high-precision OIE extractions (826MB compressed) from the OIE system ReVerb. The extractions were made from the ClueWeb09 corpus. The data contains (subject, relation, object) triples, accompanied by a confidence score (estimating the likelihood of whether the triple was correctly extracted) and provenance information (the link of the web-page where the triple was extracted from).
  • ReVerb extractions (linked): 3 million triples with linked argument (a subset of the 15 M high-precision ReVerb extractions). The links (to Freebase) are provided by an entity linker. The data fields are: argument 1, relation phrase, argument 2, freebase ID for argument 1 link, corresponding freebase entity name, link score, link ambiguity score
  • PATTY: PATTY is a system that takes open relations between two arguments, structures them into relational synsets and then organizes the synsets into a taxonomy. This resource contains over 15M triples with disambiguated arguments (links to WikiPedia articles) and relation synset ID between them. Additionaly, the resource contains: 1) relation pattern synsets with type signatures; 2) relation pattern subsumptions; 3) relation paraphrases; 4) evaluation data;
  • WiseNet (1.0 and 2.0): similarly as PATTY, WiseNet 1.0/2.0 is a source containing of OIE triples, where the arguments are disambiguated and the open relations are organized into relation synsets and then taxonomized. One of the main differences between PATTY and WiseNet is that WiseNet contains "golden links" for the arguments (annotated by humans) by keeping the original links from the WikiPedia articles.
  • KB-Unify: KB-Unify takes as an input several OIE corpora and unifies them into a single disambiguated OIE repository. The open relations are organized into relational synsets and the arguments are disambiguated with BabelFy.

Resources derived from OIE output

PhD theses

Demos