Skip to content
Krishanu edited this page Aug 27, 2017 · 15 revisions

Welcome to the List-Extractor wiki!

Abstract:

Wikipedia, being the world’s largest encyclopedia, has humongous amount of information present in form of text. While key facts and figures are encapsulated in the resource’s infobox, and some detailed statistics are present in the form of tables, but there’s also a lot of data present in form of lists which are quite unstructured and hence its difficult to form into a semantic relationship. The project focuses on the extraction of relevant but hidden data which lies inside lists in Wikipedia pages. The main objective of the project would be to create a tool that can extract information from wikipedia lists and form appropriate RDF triplets that can be inserted in the DBpedia dataset.

The inception of this project was in Google Summer of Code 2016, and the it's detailed progress report can be found here.

The final work of GSoC'16 can be found here.

The detailed progress report of my work and contributions (GSoC'17), goals and results can be found here.

GSoC'17 Final results and challenges available here.

The final work of GSoC'17 can be found here.

For a detailed explaination about List-Extractor, refer to the documentation in the docs folder. The sample generated datasets can be found here.

Have Questions? Post your queries on the DBpedia support page here.

Architecture

The Extractor has 3 main parts:

  • Request Handler: Selects the resource(s) depending on the user's options and makes corresponding resource requests to the JSONpedia service for list data.

  • JSONpedia Service: JSONpedia Service provides the resource's information in a well-structured JSON format, which is used by the mapping functions to form appropriate triples from the list data. Currently, JSONpedia Live is being used, which is a web-service and is hence susceptible to be overloaded by large volume of requests. An objective in this year's project is to overcome this bottleneck, and using the JSONpedia Library instead of the Live service.

  • Mapper: This is the set of modules which use the JSON recieved from the JSONpedia Service and produce appropriate triples which can be serialized. The first part is cleaning the JSON dictionary to extract only meaningful list data. This data is then passed to a mapping Selector module, which using the rules present in the mapping rules, which are formed in accordance to the DBpedia Ontolgy, selects the mapping functions that are needed to be applied to the elements. The mapping functions then form appropriate triples, which are then serialized into a RDF graph.

How to run the tools

This project contains 2 differnt tools: List-Extractor and Rules-Generator. Use rulesGenerator.py first to generate desired rules, and then use listExtractor.py to extract triples for wiki resources. Alternatively, you can use only listExtractor.py and extract with existing default settings.

List-Extractor:

python listExtractor.py [collect_mode] [source] [language] [-c class_name]

  • collect_mode : s or a

    • use s to specify a single resource or a for a class of resources in the next parameter.
  • source: a string representing a class of resources from DBpedia ontology (find supported domains below), or a single Wikipedia page of an actor/writer.

  • language: en, it, de etc. (for now, available only some languages, for selected domains)

    • a two-letter prefix corresponding to the desired language of Wikipedia pages and SPARQL endpoint to be queried.
  • -c --classname: a string representing classnames you want to associate your resource with. Applicable only for collect_mode="s".

Examples:

  • python listExtractor.py a Writer it
  • python listExtractor.py s William_Gibson en : Uses the default inbuilt mapper-functions
  • python listExtractor.py s William_Gibson en -c CUSTOM_WRITER : Uses the CUSTOM_WRITER mapping only to extract list elements.

If successful, a .ttl file containing RDF statements about the specified source is created inside a subdirectory called extracted.

NOTE: While extracting triples from multiple resources in a domain (collect_mode = a), using Ctrl + C will skip the current resource and move on to the next resource. To quit the extractor, use Ctrl + \.

Rules-Generator:

python rulesGenerator.py

  • This is an interactive tool, select the options given in the menu for using the rules generator.
  • While creating new mapping rules or mapper functions, make sure to follow the required format as suggested by the tool.
  • Upon successful addition/modification, it will update the settings.json and custom_mapper.json so that the new user defined rules functions can run with extractor.

Default Mapped Domains:

  • English (en):

    • Person: Writer, Actor, MusicalArtist, Athelete, Polititcian, Manager, Coach, Celebrity etc.
    • EducationalInstitution: University, School, College, Library
    • PeriodicalLiterature: Magazines, Newspapers, AcademicJournals
    • Group: Band
  • Other (it, de, es):

    • Writer, Actor, MusicalArtist
  • More Domains can be added using the rulesGenerator.py tool.

Attributions

This project uses 2 other existing open source projects.

  • JSONpedia, a framework designed to simplify access at MediaWiki contents transforming everything into JSON. Such framework provides a library, a REST service and CLI tools to parse, convert, enrich and store WikiText documents.

The software is copyright of Michele Mostarda ([email protected]) and released under Apache v2.0 License. Link : JSONpedia

  • JCommander, a very small Java framework that makes it trivial to parse command line parameters.

Contact Cédric Beust ([email protected]) for more information. Released under Apache v2.0 License. Link : JCommander

Requirements