Skip to content

frbr:lebo2011twed2

Timothy Lebo edited this page Feb 14, 2012 · 129 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

This page has materials for my second TWed talk, which continues the introduction to csv2rdf4lod during last month's TWed talk.

For this talk (slides), I promised to cover the converter's "advanced features". Before we do (some of) that, let's review the features we covered last time:

  • Integration stages: name, retrieve, (adjust), convert, (enhance), publish, (enhance).
  • The "Big Three" identifiers used to name a dataset: Source, Dataset, and Version.
  • Smart, naive bootstrap to contextualize the names of entities, and predicates, and classes.
    • incrementally "peel away" context to increase interoperability.
  • Raw versus Enhanced: two Layers describing the same entities.
    • incremental and backward compatible (i.e., monotonic).
  • Enhancement parameters are declarative, RDF, lightweight, domain-independent, and re-applicable. We covered seven of them:

By the end of the tutorial that night, we got the enhancement parameters from the default:

      #conversion:interpret [
      #   conversion:symbol        "";
      #   conversion:interpretation conversion:null; 
      #];
      #conversion:enhance [
      #   conversion:domain_template "tool_[r]";
      #   conversion:domain_name     "Tool";
      #];
      #conversion:enhance [
      #   conversion:class_name "Tool";
      #   conversion:subclass_of <http://purl.org/...>;
      #];
      conversion:enhance [
         ov:csvCol          1;
         ov:csvHeader       "Geographic Coordinates for U.S. Farmers Markets";
         #conversion:label   "Geographic Coordinates for U.S. Farmers Markets";
         conversion:comment "";
         conversion:range   todo:Literal;
      ];

to enhancement parameters that produces slightly better RDF:

      conversion:enhance [
         ov:csvRow 5;
         a conversion:HeaderRow;
      ];
      conversion:interpret [
         conversion:symbol        "";
         conversion:interpretation conversion:null;
      ];
      conversion:enhance [
      #  conversion:domain_template "tool_[r]";
         conversion:domain_name     "FarmersMarket";
      ];
      #conversion:enhance [
      #   conversion:class_name "Tool";
      #   conversion:subclass_of <http://purl.org/...>;
      #];
      conversion:enhance [
         ov:csvCol          1;
         ov:csvHeader       "locaddstate";
         conversion:comment "State that the farmers' market is in.";
         conversion:range   rdfs:Resource; # was rdfs:Literal
         conversion:range_name "State"; # was rdfs:Literal
         # Lod-linking:(owl:sameAs)
         conversion:links_via <http://www.rpi.edu/~lebot/lod-links/state-fips-dbpedia.ttl>,
                              <http://www.rpi.edu/~lebot/lod-links/state-fips-geonames.ttl>,
                              <http://www.rpi.edu/~lebot/lod-links/state-fips-govtrack.ttl>;
         conversion:subject_of dcterms:identifier;
      ];
      ...
      conversion:enhance [
         ov:csvCol          6;
         conversion:equivalent_property wgs:long;
         conversion:range   xsd:decimal;
      ];
      conversion:enhance [
         ov:csvCol          7;
         conversion:equivalent_property wgs:lat;
         conversion:range   xsd:decimal;
      ];

The enhancement parameters above got us from raw RDF that looked like:

@prefix ds4383: <http://localhost/source/data-gov/dataset/4383/version/2011-Sep-28/> .
@prefix raw:    <http://localhost/source/data-gov/dataset/4383/vocab/raw/> .

ds4383:thing_1367 
   dcterms:isReferencedBy <http://localhost/source/data-gov/dataset/4383/version/2011-Sep-28> ;
   void:inDataset         <http://localhost/source/data-gov/dataset/4383/version/2011-Sep-28> ;
   raw:column_1 "Hawaii" ;
   raw:column_2 "Alii Garden Market Place" ;
   raw:column_3 "75-6129 Alii Drive" ;
   raw:column_4 "Kailua-Kona" ;
   raw:column_5 "96740" ;
   raw:column_6 "-155.9819183" ;
   raw:column_7 "19.61436844" ;
   raw:column_8 "" ;
   ov:csvRow "1367"^^xsd:integer .

to enhanced RDF that looks like:

@prefix ds4383:       <http://localhost/source/data-gov/dataset/4383/version/2011-Sep-28/> .
@prefix ds4383_vocab: <http://localhost/source/data-gov/dataset/4383/vocab/> .
@prefix e1:           <http://localhost/source/data-gov/dataset/4383/vocab/enhancement/1/> .

ds4383:farmersMarket_1367 
   dcterms:isReferencedBy <http://localhost/source/data-gov/dataset/4383/version/2011-Sep-28> ;
   void:inDataset         <http://localhost/source/data-gov/dataset/4383/version/2011-Sep-28> ;
   a ds4383_vocab:FarmersMarket ;
   e1:locaddstate typed_state:Hawaii ;
   e1:mktname    "Alii Garden Market Place" ;
   e1:locaddst   "75-6129 Alii Drive" ;
   e1:locaddcity "Kailua-Kona" ;
   e1:locaddzip  "96740" ;
   wgs:long      "-155.9819183"^^xsd:decimal ;
   wgs:lat       "19.61436844"^^xsd:decimal ;
   ov:csvRow     "1367"^^xsd:integer .

@prefix govtrackusgov: <http://www.rdfabout.com/rdf/usgov/geo/us/> .
@prefix dbpedia:       <http://dbpedia.org/resource/> .

typed_state:Hawaii 
   dcterms:identifier "Hawaii" ;
   a ds4383_vocab:State ;
   rdfs:label "Hawaii" ;
   owl:sameAs <http://sws.geonames.org/5855797/> , govtrackusgov:HI , dbpedia:Hawaii .

Although the above enhanced RDF that we got during the tutorial is better than the raw, the following is even better because it reuses existing vocabulary that is recognized by existing systems:

@prefix con: <http://www.w3.org/2000/10/swap/pim/contact#> .
@prefix implicit_address: 
<http://localhost/source/data-gov/dataset/4383/version/2011-Sep-27/http_www_w3_org_2000_10_swap_pim_contact_address/>
.

ds4383:farmersMarket_1367 
   dcterms:isReferencedBy <http://localhost/source/data-gov/dataset/4383/version/2011-Sep-27> ;
   void:inDataset         <http://localhost/source/data-gov/dataset/4383/version/2011-Sep-27> ;
   a ds4383_vocab:FarmersMarket ;
   con:address   implicit_address:address_1367 ;
   dcterms:title "Alii Garden Market Place" ;
   wgs:lat "-155.9819183"^^xsd:decimal ;
   wgs:long "19.61436844"^^xsd:decimal ;
   ov:csvRow "1367"^^xsd:integer .

implicit_address:address_1367 
   a con:Address ;
   con:stateOrProvince typed_state:Hawaii , 
                       <http://sws.geonames.org/5855797/> , 
                       govtrackusgov:HI , dbpedia:Hawaii ;
   con:street "75-6129 Alii Drive" ;
   con:city   "Kailua-Kona" ;
   con:zip    "96740" .

typed_state:Hawaii 
   dcterms:identifier "Hawaii" ;
   a ds4383_vocab:State ;
   rdfs:label "Hawaii" ;
   owl:sameAs <http://sws.geonames.org/5855797/> , govtrackusgov:HI , dbpedia:Hawaii .

On the agenda for tonight:

  • Finish up Farmers Market example
    • Reconstruct (and verify) the RDF
  • Start and Finish RPI Research Center example
    • Reconstruct (and verify) the RDF
    • Layer 1 versus Layer 2 (cell-based qb)
    • Subclass enhancement
    • Templates to consolidate People
    • LOD-Linking from SPARQL query
    • SPARQL Named graph organization
    • Publish/bin/*

Finishing up Farmers Market

Since someone (me) already went through the effort to specify the enhancement parameters, others (us) can reapply them to reconstruct the same RDF. We can reconstruct it using the following commands.

mkdir ~/Desktop/reproduce; cd ~/Desktop/reproduce
svn checkout https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/data-gov/4383 source/data-gov/4383
cd source/data-gov/4383/version/
./retrieve.sh
cd 2011-Nov-02

The above commands will retrieve and enhance the original tabular data (skipping the raw layer, b/c that is useless). But how do we know we "got it right"? We can use cr-test-conversion.sh to run SPARQL query unit tests that are version controlled in the data skeleton, along with retrieve.sh and the enhancement parameters. Testing is done within the conversion cockpit, just like conversion.

cd 2011-Nov-02
cr-test-conversion.sh --setup -v

will populate a TDB triple store in a local directory and run the unit tests against it:

../../rq/test/ask/present/alabama-lod-linked-directly-referenced.rq (Ask => Yes)

      ?address 
         con:stateOrProvince typed_state:Alabama, dbpedia:Alabama , govtrackus:AL , 
                             <http://sws.geonames.org/4829764/> ;
         con:zip "36420";
      .
      typed_state:Alabama 
         dcterms:identifier "Alabama";
         rdfs:label         "Alabama";
         owl:sameAs dbpedia:Alabama , govtrackus:AL , <http://sws.geonames.org/4829764/> 
      .

................................................................................
../../rq/test/ask/present/alabama-lod-linked-indirectly-referenced.rq (Ask => Yes)

      ?address 
         con:stateOrProvince typed_state:Alabama;
         con:zip "36420";
      .
      typed_state:Alabama 
         dcterms:identifier "Alabama";
         rdfs:label         "Alabama";
         owl:sameAs dbpedia:Alabama , govtrackus:AL , <http://sws.geonames.org/4829764/> 
      .

--------------------------------------------------------------------------------
2 of 2 passed

This enhanced layer is used in Alvaro's Farmers Markets demo:

farmers markets in 12180

New dataset: RPI administration handed us a spreadsheet a couple of weeks ago and said, "show it to us!"

Following Automated creation of a new Versioned Dataset again, we can follow the retrieve, convert, test cycle:

mkdir ~/Desktop/reproduce; cd ~/Desktop/reproduce
svn checkout https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/data-rpi-edu/research-centers/ source/data-rpi-edu/research-centers
cd source/data-rpi-edu/research-centers/version/
./retrieve.sh
 cd 2011-Nov-02
cr-test-conversion.sh --setup --verbose

will end up with:

................................................................................
../../rq/test/ask/present/people-lod-link.rq (Ask => Yes)

      <http://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/typed/person/James_Hendler> 
         a research-centers_vocab:Faculty , foaf:Person ;
         dcterms:identifier "James Hendler" ;
         owl:sameAs <http://dbpedia.org/resource/James_Hendler> .

-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/           \ \ \ FAIL / / /
../../rq/test/ask/present/person-not-a-person.rq (Ask => No)

      <http://logd.tw.rpi.edu/demo/rpidemo/typed/person/Lucy_T_Zhang> a foaf:Person .

--------------------------------------------------------------------------------
5 of 7 passed

Enhancement Layer 1 looks like:

@prefix research-centers: 
  <http://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02/> .
@prefix typed_person:     
  <http://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/typed/person/> .

research-centers:researchCenter_5 
  dcterms:isReferencedBy <http://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02>;
  void:inDataset <http://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02>;
   a foaf:Organization , research-centers_vocab:ResearchCenter ;
   e1:fund_home_organization_description 
      value_of_fund_home_organization_description:Center_for_Advanced_Interconnect_Systems_Technologies ;
   e1:fund_home_portfolio_description "Vice President of Research" ;
   e1:expenditures "113670"^^xsd:integer ;
   foaf:member typed_person:Toh-Ming_Lu , 
               typed_person:James_Lu ;
   e1:core_facilities typed_facility:Clean_Room ;
   e1:signature_thrust typed_thrust:Nanotech , 
                       typed_thrust:Energy_Envt ;
   e1:school typed_school:School_of_Science_School_of_Engineering ;
   foaf:member typed_person:David_Duquette , 
               typed_person:Daniel_Gall ;
   e1:funding_source_distribution "Corp 16.4%, Other 2.72%,  State 80.88% " ;
                                 e1:corporation "16.4"^^xsd:decimal ;
                                 e1:state "80.88"^^xsd:decimal ;
                                 e1:other "2.72"^^xsd:decimal ;
   e1:average_expenditures "113,670" ;
   e1:average_oh "8,500" ;
   e1:acronym "CAIST" ;
   ov:csvRow "5"^^xsd:integer .

Running ./convert-research-centers.sh -e 2 will produce Enhancement Layer 2, which [converts with cell based subjects](Converting with cell based subjects) to create RDF Data Cube-friendly RDF:

research-centers:expenditureProportion_5_14 
  dcterms:isReferencedBy <http://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02>;
   void:inDataset        <http://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02>;
   a research-centers_vocab:ExpenditureProportion ;

   e2:research_center typed_researchcenter:Center_for_Advanced_Interconnect_Systems_Technologies ;
   e2:funding_type <http://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/funding-type/Corporation> ;
   rdf:value "0.16399999999999998"^^xsd:decimal ;

   ov:csvRow "5"^^xsd:integer ;
   ov:csvCol "14"^^xsd:integer ;
   e2:acronym "CAIST" .

research-centers:expenditureProportion_5_16 
   dcterms:isReferencedBy <http://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02>;
   void:inDataset        <http://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02>;
   a research-centers_vocab:ExpenditureProportion ;

   e2:research_center typed_researchcenter:Center_for_Advanced_Interconnect_Systems_Technologies ;
   e2:funding_type <http://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/funding-type/State> ;
   rdf:value "0.8088"^^xsd:decimal ;

   ov:csvRow "5"^^xsd:integer ;
   ov:csvCol "16"^^xsd:integer ;
   e2:acronym "CAIST" .

research-centers:expenditureProportion_5_18 
   dcterms:isReferencedBy <http://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02>;
   void:inDataset         <http://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02>;
   a research-centers_vocab:ExpenditureProportion ;

   e2:research_center typed_researchcenter:Center_for_Advanced_Interconnect_Systems_Technologies ;
   e2:funding_type <http://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/funding-type/Other> ;
   rdf:value "0.027200000000000002"^^xsd:decimal ;

   ov:csvRow "5"^^xsd:integer ;
   ov:csvCol "18"^^xsd:integer ;
   e2:acronym "CAIST" .
  • conversion:subclass_of is an enhancement that connects local vocabulary to popular vocabulary, making our data more interoperable.
      conversion:enhance [
         conversion:domain_name "ResearchCenter";
      ];
      conversion:enhance [
         conversion:class_name "ResearchCenter";
         conversion:subclass_of foaf:Organization;
      ];

results in:

 typed_researchcenter:Center_for_Advanced_Interconnect_Systems_Technologies 
    a research-centers_vocab:ResearchCenter, # <- A local class just created.
      foaf:Organization;                     # <- A class that "everyone" recognizes.
    dcterms:identifier "Center for Advanced Interconnect Systems Technologies";
    rdfs:label         "Center for Advanced Interconnect Systems Technologies";
  • More LOD-Linking with conversion:links_via, this time with a SPARQL query against our Abstract Person Instance Hub named graph.
      conversion:enhance [
         ov:csvCol          4, 9, 10, 11;
         conversion:equivalent_property foaf:member;
         conversion:range_template "[/sd]typed/person/[.]";
         rdfs:comment "lod-links from <http://logd.tw.rpi.edu/source/twc-rpi-edu/dataset/instance-hub-people>";
         a conversion:CaseInsensitiveLODLink;
         conversion:links_via # Sesame doesn't like redirects?: <http://purl.org/twc/query/instance-hub/intranet/people>;
<http://logd.tw.rpi.edu:8890/sparql?default-graph-uri=&query=PREFIX+foaf%3A++++%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0D%0APREFIX+dcterms%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0D%0APREFIX+owl%3A+++++%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E%0D%0ACONSTRUCT+{+%3Fperson+dcterms%3Aidentifier+%3Fid+}%0D%0AWHERE+{%0D%0A++GRAPH+%3Chttp%3A%2F%2Flogd.tw.rpi.edu%2Fsource%2Ftwc-rpi-edu%2Fdataset%2Finstance-hub-people%3E++{%0D%0A++++%3Fp+a+foaf%3APerson%3B+owl%3AsameAs+%3Fperson+%3B+dcterms%3Aidentifier+%3Fid%0D%0A++}%0D%0A}&debug=on&timeout=&format=application%2Frdf%2Bxml>;
         conversion:subject_of dcterms:identifier;
         conversion:range rdfs:Resource;
      ];

which uses this SPARQL query:

PREFIX foaf:    <http://xmlns.com/foaf/0.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX owl:     <http://www.w3.org/2002/07/owl#>
CONSTRUCT { ?person dcterms:identifier ?id }
WHERE {
  GRAPH <http://logd.tw.rpi.edu/source/twc-rpi-edu/dataset/instance-hub-people>  {
    ?p a foaf:Person; owl:sameAs ?person ; dcterms:identifier ?id
  }
}
void:dataDumps are loaded into the SPARQL endpoint

Questions:

  • For each research center, how many foaf:members does it have?
  • Number of members versus funding amount for each researcher?

Research Center Web Applications

Using the enhanced RDF created above, two web applications were created and connected by their common topic:

pie chart node link diagram of RPI research centers funding and interconnection by shared members

Clicking on "Center for Flow Physics and Control" will resolve its URI and redirect to HTML:

pie chart node link diagram of RPI research centers funding and interconnection by shared members

Enhancement Layer 3

csv2rdf4lod is designed to allow third parties to incrementally improve the RDF representation of tabular data. The layering design that csv2rdf4lod embodies permits backward compatibility by ensuring monotonic assertions with subsequent layers. Since the demonstrations above already use layers 1 and 2, we don't want to prematurely change that structure.

But since they were created, we realized a better way to model the structure. We can press on and start a layer 3:

cd source/data-rpi-edu/research-centers/version/2011-Nov-02
./convert-research-centers.sh -e 3

Clone this wiki locally