-
Notifications
You must be signed in to change notification settings - Fork 270
GSOC2013_Progress_Hady Elsahar
hady elsahar edited this page Mar 16, 2014
·
36 revisions
- Sebastian Hellman
- Dimitris kontokostas
#Project Progress:
- public clone of Extraction framework
- preparing development environment
- compiling the Extraction framework
- Getting to know DBpedia main classes structures of the extraction framework
readings
- papers No. #1 #2 #4 in DBpedia publications
- http://wiki.dbpedia.org/Documentation
important discussions :
- exploring the PubSubHubbub Protocol
- installing a local Hub and subscribing to some RSS Feed
Overview about the PubSubHubbub protocol
readings
- PubSubHubbub home page : https://code.google.com/p/pubsubhubbub/
important discussions :
- Create a RDF dump out of 1-2K WikiData entities
- work on the language links from API:
- process Wikidata info, generate master IL links file.
- produce language-specific same_as files from master IL links file,
- Create a few mappings in the mappings wiki (as owl:equivalentProperty). The most common ones in the dumps
important discussions :
- step 1: Creating Master LLinks file (replacing the old bash commands with scala code)
- Step 2: Creating specific LLinks extraction in folders (after some number of code iterations we agreed upon that we can depend on that links comes in blocks ) , Implemented Algorithm
- updating code to utilize some Extraction framework utilities instead of rewriting them
- Code Reviews 1 , 2 ,3
- More code reviews , some code conflicts
important links/Discussions :
- implicit conversions in Scala
- Master branch uses Scala 2.9 , Dump branch uses Scala 2.10
- Updating RichReader.foreach to support end of lines detection
- recent commits : https://github.com/hadyelsahar/extraction-framework/commits/lang-link-extract
--- off to Leipzig 2-8 > 6-8
- updating Pom.xml (adding scala launcher for LL scala scripts)
- setting lgd.aksw server (cloning repos , managing conflicted files , run maven install)
- Running wda-export-data.py script on lgd server
important discussions/Links :
Language links extraction process:
- Running the wda script and using the option 'turtle-links'
- unzipping the extracts and convert it to Nturtle format using rapper
rapper -i turtle turtle-20130808-links.ttl
- Generating Master LLfiles using command
sudo mvn scala:run -Dlauncher=GenerateLLMasterFile
- Generate specific Language links files :
sudo mvn scala:run -Dlauncher=GenerateLLSpecificFiles
ps: in steps 3 and 4 update the arguments of each script (the location of input / output dumps ) in the pom.xml file inside the scripts folder
what's done so far :
- 7M triples that passed from rapper phase without encountering a bug
- Running Master LL files extraction ( the output dump in
/root/hady_wikidata_extraction/Datasets/languagelinks/MasterLLfile.nt
) - Running Specific LL files extraction ( the output now is in
/root/hady_wikidata_extraction/Datasets/languagelinks/LLfiles/
)
Benchmark (for the 7 Million triples on the lgd server):
- Generating Master LLfile : 28 secs
- Generating Sepcific Files : 3 Minutes ,10 seconds
Updates #2:
- Running the new version of wda python script
- Running rapper on the resulted dump (
/Datasets/turtle-20130811-links.ttl
) - [Bugs Found] only 7.5M triples extracted (500K more) in (
/Datasets/turtle-20130811-links.nt
)
- Setting Extraction framework environment and Running initial Code added for WikiJsonParer Language links Extraction (locally and on the lgd server)
- updating Extraction Framework to Download Wikidata Dumps and Daily Wikidata Dumps
- Writing a WikidataLLExtractor for extraction of DBpedia Language links in the format
<http://oc.dbpedia.org/resource/Betta_splendens> <http://www.w3.org/2002/07/owl#sameAs> <http://ceb.dbpedia.org/resource/Betta_splendens> .
<http://oc.dbpedia.org/resource/Betta_splendens> <http://www.w3.org/2002/07/owl#sameAs> <http://war.dbpedia.org/resource/Betta_splendens> .
<http://oc.dbpedia.org/resource/Betta_splendens> <http://www.w3.org/2002/07/owl#sameAs> <http://bn.dbpedia.org/resource/সিয়ামিজ_লড়াকু_মাছ> .
- updating WikidataJsonParser for extraction of WikiData Labels
- Writing a WikidataLLExtractor for extraction of DBpedia Language links in the format
<http://wikidata.dbpedia.org/resource/Q549> <http://www.w3.org/2000/01/rdf-schema#label> "Bojovnica pestrá"@sk .
<http://wikidata.dbpedia.org/resource/Q549> <http://www.w3.org/2000/01/rdf-schema#label> "Beta"@tr .
<http://wikidata.dbpedia.org/resource/Q549> <http://www.w3.org/2000/01/rdf-schema#label> "Жауынгер балық"@kk .
- Creating wikidataSameasExtractor to extract sameeas Mapping links between Wikidata entities and DBpedia URIs
<http://wikidata.dbpedia.org/resource/Q1934> <http://www.w3.org/2002/07/owl#sameAs> <http://da.dbpedia.org/resource/Sidney_Govou> .
<http://wikidata.dbpedia.org/resource/Q1934> <http://www.w3.org/2002/07/owl#sameAs> <http://bg.dbpedia.org/resource/Сидни_Гову> .
<http://wikidata.dbpedia.org/resource/Q1934> <http://www.w3.org/2002/07/owl#sameAs> <http://ar.dbpedia.org/resource/سيدني_غوفو> .
- updating WikidataJsonParser to allow wikidata Facts extraction
- Creating wikidataFactsExtractor to extract Wikidata Facts Triples in the form
<http://wikidata.dbpedia.org/resource/Q1569> <http://www.wikidata.org/entity/P473> "040"@en .
<http://wikidata.dbpedia.org/resource/Q1569> <http://www.wikidata.org/entity/P625> "53 10"@en .
<http://wikidata.dbpedia.org/resource/Q1569> <http://www.wikidata.org/entity/P281> "20537"@en .
<http://wikidata.dbpedia.org/resource/Q1569> <http://www.wikidata.org/entity/P131> <http://wikidata.dbpedia.org/resource/Q1626> .
<http://wikidata.dbpedia.org/resource/Q1569> <http://www.wikidata.org/entity/P107> <http://wikidata.dbpedia.org/resource/Q618123> .
<http://wikidata.dbpedia.org/resource/Q549> <http://www.wikidata.org/entity/P574> "+00000001910-01-01T00:00:00Z"@en .
<http://wikidata.dbpedia.org/resource/Q549> <http://www.wikidata.org/entity/P18> "http://commons.wikimedia.org/wiki/File:HM_Orange_M_Sarawut.jpg"@en .
<http://wikidata.dbpedia.org/resource/Q549> <http://www.wikidata.org/entity/P373> "Betta splendens"@en .
- WikiData-DBpedia-Dump-Release-v.0.1
- adding
Wikidata
namespace to the mapping wiki to allow using ofwikidata:xx
indicating wikidata entities - writing some command lines to get updated Mappings properties from live owl file
- building documents for allowing community contribution of adding Mappings between Wikidata and DBpedia properties
- adding mappings for 21 Wikidata properties
- manual for adding Mappings between DBpedia and Wikidata properties
- GDoc for already Mapped properties
- the mappings wiki renders links for equivalent wikidata properties
- Re-implementing Quad Method to accept Wikidata (String) properties with unknown language
quads += new Quad(null , DBpediaDatasets.WikidataFacts, subjectUri, property ,fact , page.sourceUri, context.ontology.datatypes("xsd:string"))
- updated Ontology Reader class to get properties/class mappings between WikiData and DBpedia
- Wikidata Mapped Dump produced with Mapped properties for URI triples only
- added NodeType to SimpleNode for each extractor to know type of data returned from parser (LL,Labels,MappedFacts,Facts)
- updating JsonParser to return Data for Mapped extractor in nodes with it's NodeType
- updating WikidataMappedFactsExtractor to generate triples for Wikidata properties of Type
globecoordinate
in the form
<http://wikidata.dbpedia.org/resource/Q5689> <http://www.w3.org/2003/01/geo/wgs84_pos#lat> "60"^^<http://www.w3.org/2001/XMLSchema#float> .
<http://wikidata.dbpedia.org/resource/Q5689> <http://www.georss.org/georss/point> "POINT(60 20)" .
<http://wikidata.dbpedia.org/resource/Q5689> <http://www.w3.org/2003/01/geo/wgs84_pos#long> "20"^^<http://www.w3.org/2001/XMLSchema#float> .
- Adding Regex in DateTime Parser to parse Wikidata Time in the ISO8601 Format
- updating WikidataMappedFactsExtractor to generate Triples with Wikidata Time facts , Mapped to DBpedia properties and DBpedia dataTypes
<http://wikidata.dbpedia.org/resource/Q41380> <http://dbpedia.org/ontology/deathDate> "1035-07-09"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://wikidata.dbpedia.org/resource/Q41380> <http://dbpedia.org/ontology/birthDate> "1000-06-28"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://wikidata.dbpedia.org/resource/Q40512> <http://dbpedia.org/ontology/deathDate> "1986-09-07"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://wikidata.dbpedia.org/resource/Q40512> <http://dbpedia.org/ontology/birthDate> "1914-09-23"^^<http://www.w3.org/2001/XMLSchema#date> .
- update ontologyReader class commits : 1, 2
- Discussion about using either Wikidata URIs or DBpedia new namespace for them
- DateTime format for Wikidata Properties
- Update DateTime parser Commit
- updated WikidataMappedFactsExtractor to generate MappedFacts for wikidata properties of Datatype CommonMediafile and String in the form :
<http://wikidata.dbpedia.org/resource/Q7194> <http://dbpedia.org/ontology/imageFlag> <http://commons.wikimedia.org/wiki/File:Flag_of_Girona_province_(unofficial).svg> .
<http://wikidata.dbpedia.org/resource/Q5772> <http://dbpedia.org/ontology/imageFlag> <http://commons.wikimedia.org/wiki/File:Flag_of_the_Region_of_Murcia.svg> .
<http://wikidata.dbpedia.org/resource/Q9465> <http://dbpedia.org/ontology/individualisedGnd> "4015602-3" .
<http://wikidata.dbpedia.org/resource/Q9957> <http://dbpedia.org/ontology/individualisedGnd> "118998935" .
- change Extractor Trait to accept [T] type argument [see commit]
- change all existing Extractors to accept type PageNode
- change functions in config.scala to load Extractors of type 'any'
- check compositeExtractor.scala to check for Extractor Type
- run and check that update works fine
- change CompostiteExtractor class to load any type of classes not only PageNode [seetcommit]
-
Wikidata integration + Refactoring Core to accept new formats #155 pull request Merged
-
adapted wikidata extraction to wikidata dump changes #188 pull request Merged
-
Run Extraction on Wikidata latest Dump 20140226 - v3.0 Extracted Wikidata output Triples