Skip to content

Converting with cell based subjects

timrdf edited this page Apr 4, 2011 · 21 revisions
diagram comparing row and cell based interpretations

Introduction

csv2rdf4lod is about parameterized specifications for how to interpret a tabular structure to produce well formed RDF that closely corresponds to the domain we are trying to model. Beyond providing a more natural structure for the resulting RDF, it provides very nice defaults for the URI naming scheme for entities mentioned within datasets and offers the flexibility to change the naming conventions used. This allows a bottom-up, incremental, and backward-compatible approach to integrating datasets from multiple datasets from multiple sources.

Background: binary vs. n-ary

The default interpretation is row-based, where a URI is minted from each row in the table, predicates are minted from the column headers, and values in cells cause a triple from the row URI to the cell value using the (binary) predicate (a.k.a., property/attribute) derived from the cell's column. The special sauce that csv2rdf4lod provides is a declarative way to express how this relatively trivial interpretation should be tweaked to make more natural representations (e.g., datatyping that string in the cell, promoting it to a (good) URI, restructuring the triple so it describes a different subject, drawing out the implicit entities being described (i.e., "normalization"), etc.). All of that happens with the enhancement parameters, which loosely correspond to the axioms in RDFS (and OWL) where appropriate -- the distinction is that RDFS and OWL assume RDF data and csv2rdf4lod handles arbitrary literals to get them to the RDF level.

The default interpretation just described creates binary relations from the row to the cell value, but some tabular structures are used to express n-ary relations where n is more than two. To interpret these correctly, we can switch from a row-based interpretation to a cell-based interpretation. To summarize:

Row-based  interpretation: table is expressing binary relations
Cell-based interpretation: table is expressing n-ary relations (n bigger than two)

Writing the enhancement parameters

  • Type the conversion:Enhancement to scovo:Item
  • The conversion:label will now be used to name the predicate of the triple from the cell to the up value (instead of using it to name the predicate of the triple from the row to the cell value).
  • If the conversion:object predicate is omitted, the object will be a Resource named using the original column header. (hhs chsi e.g.)
    • Note that this misses using header as a literal automatically, but one shouldn't be going out of their way to keep something a literal, especially something important enough to be listed in the header.
  • An conversion:object value of "[/sd]/value-of/[@]/[.]" will omit the subject discriminator when naming the Resource.
  • The conversion:object can be a template, e.g. conversion:object "[/sd]typed/council/[H]"; will type-promote the header outside of the subjectDiscrimiator.

When facing many cell-based columns, the Script: cell ify params.awk can help automate to modify the enhancement parameters.

Examples

@prefix scovo:      <http://purl.org/NET/scovo#> .
@prefix conversion: <http://purl.org/twc/vocab/conversion/> .
@prefix :  <http://logd.tw.rpi.edu/source/nci-nih-gov/dataset/state-tobacco-tax/version/2001-Jan-01/params/enhancement/1/> .

:dataset a void:Dataset;
   conversion:base_uri           "http://logd.tw.rpi.edu"^^xsd:anyURI;
   conversion:source_identifier  "nci-nih-gov";
   conversion:dataset_identifier "state-tobacco-tax";
   conversion:dataset_version    "2010-Mar-29";
   conversion:conversion_process [
      a conversion:RawConversionProcess;
      conversion:enhancement_identifier "1";
      conversion:enhance [
         ov:csvRow 2;
         a conversion:HeaderRow;
      ];
      conversion:enhance [
         ov:csvRow 53;
         a conversion:DataEndRow;
      ];
      conversion:enhance [
         ov:csvCol         1;
         ov:csvHeader     "";
         conversion:label "State Order";
         conversion:range  xsd:integer;
         conversion:bundled_by [ ov:csvCol 2 ];
      ];
      conversion:enhance [
         ov:csvCol         2;
         ov:csvHeader     "";
         conversion:label "State"; 

         conversion:range  rdfs:Resource;

         conversion:range_name "State";

         conversion:links_via <http://www.rpi.edu/~lebot/lod-links/state-fips-dbpedia.ttl>,
                              <http://www.rpi.edu/~lebot/lod-links/state-fips-geonames.ttl>,
                              <http://www.rpi.edu/~lebot/lod-links/state-fips-govtrack.ttl>;
         conversion:subject_of dcterms:identifier;

         conversion:domain_name "Annual tax average";
      ];
      conversion:enhance [
         ov:csvCol         3;
         ov:csvHeader     "2000"; 

         a scovo:Item;
         conversion:label "Year";            # Property from cell URI to "2000"^^xsd:gYear
         conversion:object "2000"^^xsd:gYear;

         conversion:range  xsd:decimal; # Range of property "out of page"
      ];

@prefix state-tobacco-tax_vocab: <http://logd.tw.rpi.edu/source/nci-nih-gov/dataset/state-tobacco-tax/vocab/> .
@prefix raw:                     <http://logd.tw.rpi.edu/source/nci-nih-gov/dataset/state-tobacco-tax/vocab/raw/> .

state-tobacco-tax:thing_3 
   raw:column_1 "1" ;
   raw:column_2 "Alabama" ;
   <http://logd.tw.rpi.edu/source/nci-nih-gov/dataset/state-tobacco-tax/vocab/raw/2000> "16.5¢" ;
   ov:csvRow 3 .

becomes

@prefix e1: <http://logd.tw.rpi.edu/source/nci-nih-gov/dataset/state-tobacco-tax/vocab/enhancement/1/> .

state-tobacco-tax:annual_tax_average_3_3 
  a state-tobacco-tax_vocab:Annual_tax_average ;
  e1:state typed_state:Alabama ;
  e1:year   "2000"^^xsd:gYear ;
  rdf:value "16.5"^^xsd:decimal; # TODO: should be in e1.
  ov:csvRow "3"^^xsd:integer ;
  ov:csvCol "3"^^xsd:integer .

Candidates for cell-based conversion: Dataset 1612, Dataset 10030, Dataset 1554, Dataset 401, Dataset 402

(SEC company financial reports - http://viewerprototype1.com/viewer choose a company, and "export to Excel")

RDF Data Cube's example

Source code for example reproducing RDF Data Cube's example is at https://github.com/timrdf/csv2rdf4lod-automation/tree/master/doc/examples/source/publishing-statistical-data-googlecode-com/example-statwales-003311/version/2011-Feb-10.

Historical note

Clone this wiki locally