Skip to content

GSOC2013_Progress_Hady Elsahar

hady elsahar edited this page Mar 16, 2014 · 36 revisions

integrating WikiData in DBpedia

proposal

full project proposal

Students

Mentors

  • Sebastian Hellman
  • Dimitris kontokostas

#Project Progress:

week 1 :

  • public clone of Extraction framework
  • preparing development environment
  • compiling the Extraction framework
  • Getting to know DBpedia main classes structures of the extraction framework

readings

important discussions :


week 2 [17-6-2013] :

  • exploring the PubSubHubbub Protocol
  • installing a local Hub and subscribing to some RSS Feed

Overview about the PubSubHubbub protocol

readings

important discussions :


week 3 [24-6-2013] :

  • Create a RDF dump out of 1-2K WikiData entities
  • work on the language links from API:
    1. process Wikidata info, generate master IL links file.
    2. produce language-specific same_as files from master IL links file,
  • Create a few mappings in the mappings wiki (as owl:equivalentProperty). The most common ones in the dumps

important discussions :


Weeks 4,5,6,7 Language Links Extraction [1-7-2013] -> [1-8-2013] :

  • step 1: Creating Master LLinks file (replacing the old bash commands with scala code)
  • Step 2: Creating specific LLinks extraction in folders (after some number of code iterations we agreed upon that we can depend on that links comes in blocks ) , Implemented Algorithm
  • updating code to utilize some Extraction framework utilities instead of rewriting them
  • Code Reviews 1 , 2 ,3
  • More code reviews , some code conflicts

important links/Discussions :


--- off to Leipzig 2-8 > 6-8


week 8 [5-8-2013] - [11-8-2013] :

  • updating Pom.xml (adding scala launcher for LL scala scripts)
  • setting lgd.aksw server (cloning repos , managing conflicted files , run maven install)
  • Running wda-export-data.py script on lgd server

important discussions/Links :


week 9 [12-8-2013] - [18-8-2013] :

Language links extraction process:

  • Running the wda script and using the option 'turtle-links'
  • unzipping the extracts and convert it to Nturtle format using rapper rapper -i turtle turtle-20130808-links.ttl
  • Generating Master LLfiles using command sudo mvn scala:run -Dlauncher=GenerateLLMasterFile
  • Generate specific Language links files : sudo mvn scala:run -Dlauncher=GenerateLLSpecificFiles

ps: in steps 3 and 4 update the arguments of each script (the location of input / output dumps ) in the pom.xml file inside the scripts folder

what's done so far :

  • 7M triples that passed from rapper phase without encountering a bug
  • Running Master LL files extraction ( the output dump in /root/hady_wikidata_extraction/Datasets/languagelinks/MasterLLfile.nt )
  • Running Specific LL files extraction ( the output now is in /root/hady_wikidata_extraction/Datasets/languagelinks/LLfiles/ )

Benchmark (for the 7 Million triples on the lgd server):

  • Generating Master LLfile : 28 secs
  • Generating Sepcific Files : 3 Minutes ,10 seconds

Updates #2:

  • Running the new version of wda python script
  • Running rapper on the resulted dump (/Datasets/turtle-20130811-links.ttl)
  • [Bugs Found] only 7.5M triples extracted (500K more) in (/Datasets/turtle-20130811-links.nt)

important links :


week 10 [19-8-2013] - [25-8-2013] :

  • Setting Extraction framework environment and Running initial Code added for WikiJsonParer Language links Extraction (locally and on the lgd server)
  • updating Extraction Framework to Download Wikidata Dumps and Daily Wikidata Dumps
  • Writing a WikidataLLExtractor for extraction of DBpedia Language links in the format
<http://oc.dbpedia.org/resource/Betta_splendens> <http://www.w3.org/2002/07/owl#sameAs> <http://ceb.dbpedia.org/resource/Betta_splendens> .
<http://oc.dbpedia.org/resource/Betta_splendens> <http://www.w3.org/2002/07/owl#sameAs> <http://war.dbpedia.org/resource/Betta_splendens> .
<http://oc.dbpedia.org/resource/Betta_splendens> <http://www.w3.org/2002/07/owl#sameAs> <http://bn.dbpedia.org/resource/সিয়ামিজ_লড়াকু_মাছ> .
  • updating WikidataJsonParser for extraction of WikiData Labels
  • Writing a WikidataLLExtractor for extraction of DBpedia Language links in the format
<http://wikidata.dbpedia.org/resource/Q549> <http://www.w3.org/2000/01/rdf-schema#label> "Bojovnica pestrá"@sk .
<http://wikidata.dbpedia.org/resource/Q549> <http://www.w3.org/2000/01/rdf-schema#label> "Beta"@tr .
<http://wikidata.dbpedia.org/resource/Q549> <http://www.w3.org/2000/01/rdf-schema#label> "Жауынгер балық"@kk .
  • Creating wikidataSameasExtractor to extract sameeas Mapping links between Wikidata entities and DBpedia URIs
<http://wikidata.dbpedia.org/resource/Q1934> <http://www.w3.org/2002/07/owl#sameAs> <http://da.dbpedia.org/resource/Sidney_Govou> .
<http://wikidata.dbpedia.org/resource/Q1934> <http://www.w3.org/2002/07/owl#sameAs> <http://bg.dbpedia.org/resource/Сидни_Гову> .
<http://wikidata.dbpedia.org/resource/Q1934> <http://www.w3.org/2002/07/owl#sameAs> <http://ar.dbpedia.org/resource/سيدني_غوفو> .
  • updating WikidataJsonParser to allow wikidata Facts extraction
  • Creating wikidataFactsExtractor to extract Wikidata Facts Triples in the form
<http://wikidata.dbpedia.org/resource/Q1569> <http://www.wikidata.org/entity/P473> "040"@en .
<http://wikidata.dbpedia.org/resource/Q1569> <http://www.wikidata.org/entity/P625> "53 10"@en .
<http://wikidata.dbpedia.org/resource/Q1569> <http://www.wikidata.org/entity/P281> "20537"@en .
<http://wikidata.dbpedia.org/resource/Q1569> <http://www.wikidata.org/entity/P131> <http://wikidata.dbpedia.org/resource/Q1626> .
<http://wikidata.dbpedia.org/resource/Q1569> <http://www.wikidata.org/entity/P107> <http://wikidata.dbpedia.org/resource/Q618123> .
<http://wikidata.dbpedia.org/resource/Q549> <http://www.wikidata.org/entity/P574> "+00000001910-01-01T00:00:00Z"@en .
<http://wikidata.dbpedia.org/resource/Q549> <http://www.wikidata.org/entity/P18> "http://commons.wikimedia.org/wiki/File:HM_Orange_M_Sarawut.jpg"@en .
<http://wikidata.dbpedia.org/resource/Q549> <http://www.wikidata.org/entity/P373> "Betta splendens"@en .
  • WikiData-DBpedia-Dump-Release-v.0.1

important links :


week 11 [26-8-2013] - [1-9-2013] :

  • adding Wikidata namespace to the mapping wiki to allow using of wikidata:xx indicating wikidata entities
  • writing some command lines to get updated Mappings properties from live owl file
  • building documents for allowing community contribution of adding Mappings between Wikidata and DBpedia properties
  • adding mappings for 21 Wikidata properties

important links :


week 12 [2-9-2013] - [8-9-2013] :

  • Re-implementing Quad Method to accept Wikidata (String) properties with unknown language
quads += new Quad(null , DBpediaDatasets.WikidataFacts, subjectUri, property ,fact , page.sourceUri, context.ontology.datatypes("xsd:string"))
  • updated Ontology Reader class to get properties/class mappings between WikiData and DBpedia
  • Wikidata Mapped Dump produced with Mapped properties for URI triples only
  • added NodeType to SimpleNode for each extractor to know type of data returned from parser (LL,Labels,MappedFacts,Facts)
  • updating JsonParser to return Data for Mapped extractor in nodes with it's NodeType
  • updating WikidataMappedFactsExtractor to generate triples for Wikidata properties of Type globecoordinate in the form
<http://wikidata.dbpedia.org/resource/Q5689> <http://www.w3.org/2003/01/geo/wgs84_pos#lat> "60"^^<http://www.w3.org/2001/XMLSchema#float> .
<http://wikidata.dbpedia.org/resource/Q5689> <http://www.georss.org/georss/point> "POINT(60 20)" .
<http://wikidata.dbpedia.org/resource/Q5689> <http://www.w3.org/2003/01/geo/wgs84_pos#long> "20"^^<http://www.w3.org/2001/XMLSchema#float> .
  • Adding Regex in DateTime Parser to parse Wikidata Time in the ISO8601 Format
  • updating WikidataMappedFactsExtractor to generate Triples with Wikidata Time facts , Mapped to DBpedia properties and DBpedia dataTypes
<http://wikidata.dbpedia.org/resource/Q41380> <http://dbpedia.org/ontology/deathDate> "1035-07-09"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://wikidata.dbpedia.org/resource/Q41380> <http://dbpedia.org/ontology/birthDate> "1000-06-28"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://wikidata.dbpedia.org/resource/Q40512> <http://dbpedia.org/ontology/deathDate> "1986-09-07"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://wikidata.dbpedia.org/resource/Q40512> <http://dbpedia.org/ontology/birthDate> "1914-09-23"^^<http://www.w3.org/2001/XMLSchema#date> .

important links :


week 13 [9-9-2013] - [15-9-2013] :

  • updated WikidataMappedFactsExtractor to generate MappedFacts for wikidata properties of Datatype CommonMediafile and String in the form :
<http://wikidata.dbpedia.org/resource/Q7194> <http://dbpedia.org/ontology/imageFlag> <http://commons.wikimedia.org/wiki/File:Flag_of_Girona_province_(unofficial).svg> .
<http://wikidata.dbpedia.org/resource/Q5772> <http://dbpedia.org/ontology/imageFlag> <http://commons.wikimedia.org/wiki/File:Flag_of_the_Region_of_Murcia.svg> .
<http://wikidata.dbpedia.org/resource/Q9465> <http://dbpedia.org/ontology/individualisedGnd> "4015602-3" .
<http://wikidata.dbpedia.org/resource/Q9957> <http://dbpedia.org/ontology/individualisedGnd> "118998935" .

important links :


End of GSoC2013 period


Refactoring the core to accept new formats :

  • change Extractor Trait to accept [T] type argument [see commit]
  • change all existing Extractors to accept type PageNode
  • change functions in config.scala to load Extractors of type 'any'
  • check compositeExtractor.scala to check for Extractor Type
  • run and check that update works fine
  • change CompostiteExtractor class to load any type of classes not only PageNode [seetcommit]

Merging pull Request and update for wikidata changes :

Clone this wiki locally