Skip to content

GSoC2020_Progress_Mykola_Medynskyi

jlareck edited this page Oct 12, 2020 · 37 revisions

Extending DIEF

Description

DBpedia is a crowd-sourced community effort to extract structured content from the various Wikimedia projects which is publicly available for everyone on the Web. This project will improve the DBpedia extraction (https://github.com/dbpedia/extraction-framework) process which is continuously being developed by community with citations, commons and lexemes information. https://summerofcode.withgoogle.com/projects/#5192294916947968

Mentors

  • Beyza Yaman
  • Sebastian Hellmann
  • Julio Hernandez

Proposal

my proposal

My working branch

branch

Pull request with my code

https://github.com/dbpedia/extraction-framework/pull/648

Additional information about Wikidata Lexeme

Wikidata Lexemes structure

Lexeme:

  • Lemma
  • Language
  • Lexical category
  • Statements
  • Forms:
    • Representation
    • Grammatical Features
    • Statements
  • Senses:
    • Gloss
    • Statements

Progress

Bonding period

On bonding period I rewrote Wikidata extractors using the last version of Wikidata Toolkit library and made pull request with those changes https://github.com/dbpedia/extraction-framework/pull/628 . I also added some configurations for Wikidata Lexeme Extractor that will be developed during Coding Period

Week 1 (June 1 - June 8)

When I was testing my update of Wikidata Extractors I noticed that I had made a mistake in rewriting of Wikidata Reference Extractor. The problem was that Wikidata Reference Extractor had to extract data from Wikidata Items and Properties but after my update it only extracted data from Items. And on this week I fixed this problem by splitting extraction process on two methods. In extract method I check the type of page and if this page is Item I deserialise the page as Item document and execute method extractStatements which extracts data from statements. The same is implemented for Properties entities. And I also added some more pages to wikidata minidump.

Week 2 (June 8 - June 15)

I have implemented first version of Lexeme Extractor that extracts data from lemmas and forms.

Week 3 (June 15 - June 21)

I have reimplemented Lexeme Extractor with new structure. I added sense's extraction and now Lexeme Extractor extracts data on the form of:

  <http://lex.dbpedia.org/wikidata/L222072> <http://www.w3.org/2002/07/owl#sameAs> <http://www.wikidata.org/entity/L222072> .
  <http://lex.dbpedia.org/wikidata/L222072> <http://www.w3.org/ns/lemon/ontolex#lexicalForm> <http://lex.dbpedia.org/wikidata/L222072-F1> .
  <http://lex.dbpedia.org/resource/cykelsadel> <http://lex.dbpedia.org/property/lexeme> <http://lex.dbpedia.org/wikidata/L222072> .
  <http://lex.dbpedia.org/resource/cykelsadel> <http://lex.dbpedia.org/property/form> <http://lex.dbpedia.org/wikidata/L222072-F1> .
  <http://lex.dbpedia.org/wikidata/L222072-F1> <http://www.w3.org/2002/07/owl#sameAs> <http://www.wikidata.org/entity/L222072-F1> .
  <http://lex.dbpedia.org/resource/sæde_på_en_cykel> <http://lex.dbpedia.org/property/lexicalSense> <http://www.wikidata.org/entity/L222072-S1> .
  <http://lex.dbpedia.org/resource/sæde_på_en_cykel> <http://lex.dbpedia.org/property/P5137> <http://www.wikidata.org/entity/Q1076532> .
  <http://lex.dbpedia.org/resource/sæde_på_en_cykel> <http://lex.dbpedia.org/property/P18> <http://commons.wikimedia.org/wiki/File:Bicycle_saddle.jpg> .

I have also fixed problem with extracting strings that have whitespaces by replacing them on "_"

In addition, I added transformation of wikimedia commons’ files on urls, because they were represented as strings and I considered that it would be better if they were represented as urls. For example we have Lexeme: L222074. It has sense which statement contains wikimedia commons’ file : "Hufeisen mit Aufzuegen DSC 3900.jpg" and this file will be represented as url: <http://commons.wikimedia.org/wiki/File:Hufeisen_mit_Aufzuegen_DSC_3900.jpg> .

Week 4 (June 21 - June 28)

On this week I have resolved problem with extraction of statements from Wikidata Lexemes. Some statements can contain files from Wikimedia Commons and there is a problem how to distinguish ordinary string values (for example transcriptions) from those files. So I have created list that contains file types in Wikimedia Commons:

private val listOfWikiCommonsFileTypes = Set(".*\\.jpg\\b".r, ".*\\.svg\\b".r,".png\\b".r, ".*\\.gif\\b".r,
    ".*\\.webp\\b".r,".*\\.tiff\\b".r, ".xcf\\b".r, ".*\\.oga\\b".r, ".*\\.wav\\b".r, ".*\\.ogg\\b".r,".*\\.ogx\\b".r,
    ".*\\.ogv\\b".r, ".*\\.mp3\\b".r,".*\\.opus\\b".r, ".flac\\b".r, ".webm\\b".r, ".*\\.pdf\\b".r,
    ".*\\.mid\\b".r,".*\\.djvu\\b".r, ".*\\.map\\b".r, ".*\\.tab\\b".r, ".*\\.stl\\b".r)

In new implementation it travers over all regexes from list and if there is a match with any regex and a string value from statement it creates a url with this file. For example we have string Books_HD_(8314929977).jpg and as you can see it has an extension .jpg, so it will have a match with regex ".*\\.jpg\\b".r and given string will be transformed by adding http://commons.wikimedia.org/wiki/File: before it. As result we will have http://commons.wikimedia.org/wiki/File:Books_HD_(8314929977).jpg. After fixing problem with statements I added extraction of lexical category and language from lexeme. And I also implemented extraction of additional information from senses and forms. Triples from forms (without statements):

<http://lex.dbpedia.org/resource/book> <http://lex.dbpedia.org/property/form> <http://lex.dbpedia.org/wikidata/L536-F1> .
<http://lex.dbpedia.org/wikidata/L536-F1> <http://www.w3.org/2002/07/owl#sameAs> <http://www.wikidata.org/entity/L536-F1> .
<http://lex.dbpedia.org/resource/book> <http://lex.dbpedia.org/property/grammaticalFeature> <http://www.wikidata.org/entity/Q110786> .

Triples from senses (without statements):

<http://lex.dbpedia.org/resource/document> <http://lex.dbpedia.org/property/lexicalSense> <http://lex.dbpedia.org/wikidata/L536-S1> .
<http://lex.dbpedia.org/wikidata/L536-S1> <http://www.w3.org/2002/07/owl#sameAs> <http://www.wikidata.org/entity/L536-S1> .

Example of triples from statements of senses and forms:

<http://lex.dbpedia.org/resource/document> <http://lex.dbpedia.org/property/P5137> <http://www.wikidata.org/entity/Q571> .
<http://lex.dbpedia.org/resource/document> <http://lex.dbpedia.org/property/P18> <http://commons.wikimedia.org/wiki/File:Books_HD_(8314929977).jpg> .
<http://lex.dbpedia.org/resource/book> <http://lex.dbpedia.org/property/P898> "/bʊk/" .

Week 5 (June 28 - July 5)

On this week I proposed the solution for getting words from entities IDs. The problem was that we needed to represent lexical categories and languages as words but we only had the IDs of them. So the possible solution for it is: For example we need to represent lexical category noun as a word, but we only have its ID Q1084. So we can create a map with ID as a key, and a word as a value. And we will have next map:

val codeToWordMap = Map ("Q1084" -> "noun")

Below you can see the example of map with some other lexical categories:

val codeToWordMap = Map (
"Q1084" -> "noun",
"Q24905" -> "verb",
"Q36224" -> "pronoun",
"Q34698" -> "adjective",
"Q380057" -> "adverb",
"Q4833830" -> "preposition")

The same solution I used for getting languages by IDs.

Triples after implementing it:

<http://lex.dbpedia.org/resource/poffertje> <http://lex.dbpedia.org/property/lexicalcategory> <http://lex.dbpedia.org/noun> .
<http://lex.dbpedia.org/resource/poffertje> <http://dbpedia.org/ontology/language> <http://lex.dbpedia.org/Dutch> .

Week 6 (June 5 - July 13)

On this week I implemented source type information extraction in Citation Extractor. I used map data structure for type matching of templates to source type. E.g the map for english language templates:

  private val typeInformation: Map[String, String] = Map(
        "cite book" -> "book",
        "cite journal" -> "journal",
        "cite web" -> "website",
        "cite comic" -> "comic",
        "comic strip reference" -> "comic_strip",
        "cite conference" -> "conference_report",
        "cite case" -> "court_case",
        "cite encyclopedia" -> "encyclopedia",
        "cite episode" -> "episode",
        "cite mailing list" -> "mailing_list",
        "cite map" -> "map",
        "cite news" -> "news_article",
        "cite newsgroup" -> "newsgroup",
        "cite patent" -> "patent",
        "cite press release" -> "press_release",
        "cite AV media" -> "video",
        "cite video game" -> "video_game"
    )

During the extraction process it matches the template and returns the source (e.g. we have “cite web” template and by it we get “website” type of source). Below you can see some extracted triples:

<http://www.aiga.org/is-archers-use-on-target/%7Cpublisher=aigi> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/website> .
<http://www.britannica.com/eb/article-9049786/mackenzie-river,> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/encyclopedia> .
<http://worldblog.msnbc.msn.com/archive/2008/05/06/984755.aspx> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/news_article> .

I also made some statistics queries on triples extracted from minidump lexemes. I got the next results: Number of nouns: 20 Number of resources: 164 Number of properties: 23

Week 7 (July 13 - July 20)

This week I added citation templates for different languages. I added next templates for Russian, Romanian, Spanish, Danish, Polish languages:

 "статья.*".r, "книга.*".r, "публикация.*".r, "cita.*".r, "cytuj.*".r, "citare.*".r, "citat.*".r, "kilde.*".r

The full list of templates is:

private val citationTemplatesRegex = List("cite.*".r, "citation.*".r, "literatur.*".r, "internetquelle.*".r, "bib.*".r,
     "статья.*".r, "книга.*".r, "публикация.*".r, "cita.*".r, "cytuj.*".r, "citare.*".r, "citat.*".r, "kilde.*".r)

I also found out that Citation Extractor should not be used with some other languages because templates in the list above can match quotes instead of citations. For example: "citation.*" regex can match citations in English and quotes in French, so according to this, I think we should not use Citation Extractor for the French language. But on the other hand, we can try to make a map for templates from different languages. For example:

val mapCiatationTemplatesRegexes = Map (
"en" -> List("cite.*".r, "citation.*".r)
"de" -> List("literatur.*".r, "internetquelle.*".r, "bib.*".r)
)

But the possible problem in this solution can be that some templates from one language can also be used in other languages (e.g. "cite.*" is used in english, ukrainian, russian and other languages). Of course, we can add this template in multiple languages, but how can we know if it is not used in others? So, I think this problem must be explored more.

On this week I also implemented first version of Commons Extractor. I reused Infobox Extractor and extracted data from some pages. It doesn't extract all information now but on the next weeks I will try to fix it. Some extracted triples:

<http://commons.dbpedia.org/resource/File:Pitkämöjärvellä_Kurikassa_e_-_panoramio.jpg> <http://commons.dbpedia.org/property/date> "2013-07-06"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://commons.dbpedia.org/resource/File:Pitkämöjärvellä_Kurikassa_e_-_panoramio.jpg> <http://commons.dbpedia.org/property/source> <https://web.archive.org/web/20161029073702/http:/www.panoramio.com/photo/92852127> .
<http://commons.dbpedia.org/resource/File:Pitkämöjärvellä_Kurikassa_e_-_panoramio.jpg> <http://commons.dbpedia.org/property/author> <https://web.archive.org/web/20161029073703/http:/www.panoramio.com/user/2072459%3Fwith_photo_id=92852127> .

Week 8-9 (July 20 - August 3)

On this two weeks I tested Citation Extractor. I made SHACL tests for some triples that were extracted from wikipedia pages from different languages. Below you can see the example of SHACL test:

<#Citaition_type>
	a sh:NodeShape ;
	sh:targetNode <http://johnkstuff.blogspot.com/2007/11/bio-in-progress.html> ;

	sh:property [
		sh:path rdf:type ;
		sh:hasValue <http://dbpedia.org/ontology/website> ;
	] .

<#Citation_polish_language>
	a sh:NodeShape ;
	sh:targetNode <http://books.google.com/books%3Fvid=ISBN978-83-04-04985-7>  ;

	sh:property [
		sh:path <http://pl.dbpedia.org/property/tytuł> ;
		sh:hasValue "Polska-Niemcy. Stosunki polityczne od zarania po czasy najnowsze" ;
	] .

During minidump testing, I got some errors that are related to georss.org. This website is used in "IRI Coverage Tests" and as I understand during the http request to this website it returns error. This causes because georss.org is not responding now. So I created the issue where I described this problem https://github.com/dbpedia/extraction-framework/issues/643 . I also continue working on Commons Extractor and debugging other extractors to find out how to extract data from blocks like permission, location and from some others. Below you can see wikimedia commons template example from which the data must be extracted:

  {{Information
  |description=Spielbichler, Ötscher
  |date={{Taken on|2013-07-20|location=Austria}}
  |source=https://web.archive.org/web/20161029204523/http://www.panoramio.com/photo/95370653
  |author=[https://web.archive.org/web/20161029204525/http://www.panoramio.com/user/7863635?with_photo_id=95370653 Martin Cígler]
  |permission={{cc-by-sa-3.0|Martin Cígler}}
  {{Panoramioreview|Panoramio upload bot|2017-02-27}}
  |other_versions=
  |other_fields={{Information field|Name=Tags&lt;br /&gt;(from Panoramio photo page)|Value=&lt;code&gt;Mitterbach am Erlaufsee&lt;/code&gt;, &lt;code&gt;Austria&lt;/code&gt;}}
  }}
  {{Location|47.842385|15.18377|source:Panoramio}}

Week 10 (August 3 - August 10)

This week I continued testing Citation Extractor using SHACL tests. I learned more about how SHACL works and created some generic tests for checking the extracted triples. For example:

<#Citation_english_language_date_datatype_validation>
	a sh:NodeShape  ;
	sh:targetSubjectsOf <http://dbpedia.org/property/date>   ;
	sh:property [
		sh:path <http://dbpedia.org/property/date>  ;
		sh:or (
		    [
		    	sh:datatype xsd:string;
		    ]
		    [
		        sh:datatype xsd:date;
		    ]
		)

	] .

As you know date can be only a string or date datatype and that's why this test checks if triple with property http://dbpedia.org/property/date has type string or date. I also made similar tests for properties: last, last1, title, accessdate, work, page, isbn. And I also continue working on Commons Extractor. When I was debugging the code of Infobox Extractor this week I discovered that for some template nodes it is necessary to implement new parsers for specific fields, like permission or location. So, I keep trying to implement them. And this week my branch was merged with the master branch.

Week 11 (August 10 - August 17)

This week I have implemented the first version of permission extraction in Commons Extractor. It is based on checking if the Property Node contains a permission word in the key field (this field in Property Node identifies the name of the property from which we need to get the data). If this property is permission, then it checks the type of the first child node and if it is a template node, it returns Parse Result object with permission value. Below you can see the code of method that parses the permission:

  private def extractPermission(node: PropertyNode) : Option[ParseResult[String]] = {
      if (node.key.contains("permission")) {
        if(node.children.nonEmpty){
          node.children.head match {
            case item: TemplateNode => {
              return Some(ParseResult(item.title.decoded, None, Some(xsdStringDt)))
            }
          }
        }
      }
      None
  }

For example, we have:

permission={{cc-by-3.0|Harri Hedman}}

the property in it is permission and value is "Cc-by-3.0". So the extracted triple for it will be:

<http://commons.dbpedia.org/resource/File:Pitkämöjärvellä_Kurikassa_-_panoramio.jpg> <http://commons.dbpedia.org/property/permission> "Cc-by-3.0" .

I also explored that data from Location templates can be extracted with Geo Extractor, so I used it and got next triples:

<http://commons.dbpedia.org/resource/File:Pitkämöjärvellä_Kurikassa_-_panoramio.jpg> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2003/01/geo/wgs84_pos#SpatialThing> .
<http://commons.dbpedia.org/resource/File:Pitkämöjärvellä_Kurikassa_-_panoramio.jpg> <http://www.w3.org/2003/01/geo/wgs84_pos#lat> "62.580103"^^<http://www.w3.org/2001/XMLSchema#float> .
<http://commons.dbpedia.org/resource/File:Pitkämöjärvellä_Kurikassa_-_panoramio.jpg> <http://www.w3.org/2003/01/geo/wgs84_pos#long> "22.408805"^^<http://www.w3.org/2001/XMLSchema#float> .
<http://commons.dbpedia.org/resource/File:Pitkämöjärvellä_Kurikassa_-_panoramio.jpg> <http://www.georss.org/georss/point> "62.580103 22.408805" .

Week 12 (August 17 - August 24)

On this week I explored that Categories from Wikimedia Commons pages can be extracted with Article Categories Extractor. So now the data from Wikimedia Commons can be extracted with: Infobox, Geo and Article Categories extractors.

What I have done

  • rewrote Wikidata Extractors from old version of Wikidata Toolkit library to the latest
  • implemented Lexeme Extractor that extracts data from Wikidata Lexemes
  • extended Citation Extractor with the citation source type information extraction and added some new regexes for matching templates from different languages
  • explored existing extractors and used Infobox, Geo and Article Categories extractors for extraction data from Wikimedia Commons pages
  • used Infobox Extractor for extracting data from Wikimedia Commons pages infoboxes and implemented permission extraction

Future work and list of issues

Wikimedia Commons Files Extractor:

  • Wikimedia Commons must be more explored and I also think that the current implementation of permissions extraction can be extended with regexes that will be used for parsing permissions. Because in the current implementation it only checks permission by word permission but those permissions can also be in other properties, so I think that mapping of permissions must be remade and extended with regex matching.
  • implement license extraction from pages
  • implement data extraction from nested templates in infoboxes (e.g from other_field parameter)

Citation Extractor:

  • can be extended with more templates from other languages
  • can be also changed the mapping of citations and used a map of languages (as a key) and templates list (as a value) instead of the templates regexes list (see Week 7 in this blog)
  • change some regexes for better parsing data from citation infoboxes. For example, we have this citation:
{{cite news |last=Soper |first=Taylor |date= February 11, 2015  
|title=Silicon Desert: How Phoenix is quickly—and quietly—becoming a hub for innovation 
|url= http://www.geekwire.com/2015/silicon-desert-phoenix-quickly-quietly-becoming-hub-innovation/|newspaper=[[GeekWire]] 
|location=
|access-date=May 15, 2016 }}&lt;/ref&gt;

And extracted url from it:

<http://www.geekwire.com/2015/silicon-desert-phoenix-quickly-quietly-becoming-hub-innovation/%7Cnewspaper=geekwire> <http://dbpedia.org/property/url> "http://www.geekwire.com/2015/silicon-desert-phoenix-quickly-quietly-becoming-hub-innovation/|newspaper=GeekWire" .

As you can see here, it added newspaper=GeekWire to the extracted value. And as I understand, the problem is in some regexes, so some regexes maybe must be changed. The output for the example above should be:

<http://www.geekwire.com/2015/silicon-desert-phoenix-quickly-quietly-becoming-hub-innovation/%7Cnewspaper=geekwire> <http://dbpedia.org/property/url> "http://www.geekwire.com/2015/silicon-desert-phoenix-quickly-quietly-becoming-hub-innovation/" .

<http://www.geekwire.com/2015/silicon-desert-phoenix-quickly-quietly-becoming-hub-innovation/%7Cnewspaper=geekwire> <http://dbpedia.org/property/newspaper> "GeekWire" .

Wikidata Lexeme Extractor

Here is the link with notes: https://docs.google.com/document/d/1x13LRTSPzLpxn4E0imuq6zTjsPMEg-B-mwJr1esHNxo/edit?usp=sharing

Clone this wiki locally