Skip to content

DCAT partitioning ideas

Maik Riechert edited this page Feb 3, 2016 · 7 revisions

Concepts:

  • dct:hasPart for non-overlapping hierarchical data

Principles:

  • make all available data of a dataset machine-accessible via at least one path (machines are smart and are assumed to be able to navigate through APIs)

Bad practices:

  • attach several subsetted files directly to the main DCAT dataset (each distribution must contain the whole dataset)

Problems:

  • what if partitioning into subdatasets happens on a dimension that cannot be expressed with standard spatial/temporal metadata
    • how would clients / catalogs be able to organize that?

Scenario 1: Daily gridded global temperature "glotemp" (daily static netcdf files)

Data

  • 1 WMS endpoint with 1 layer with time dimension: /glotemp/wms
  • netcdf files: /glotemp/data/YYYY-MM-DD.nc, e.g. 2016-01-01.nc
  • summary website (/glotemp) linking to netcdf files with preview images from WMS

Recommendation

DCAT structure

  • have one root DCAT dataset with one child DCAT dataset per day
  • root:
    • URI: /glotemp
    • link to summary website (foaf:homepage), here identical to dataset URI (not required though)
    • distributions: WMS only
    • child datasets linked via dct:hasPart
  • child:
    • URI: /glotemp/YYYY-MM-DD
    • distributions: netcdf; optional: preview image (link to WMS image)
    • link to root dataset via dct:isPartOf
    • correct temporal extent metadata

Notes

  • it is important to include temporal metadata (GeoDCAT-AP) so that machine clients can find the right subdataset
  • /glotemp/YYYY-MM-DD does not resolve to anything, it is just an identifier
  • this is fine for a basic implementation
  • advanced implementations are not always achievable within time/money constraints
  • the RDF DCAT data should be stored in a file, e.g. /glotemp/dcat.jsonld
  • the RDF file should be exposed at /glotemp via:
    • content negotiation (302 redirect) and/or
    • Link (rel=alternate) header and/or
    • in HTML summary page via <link rel="alternate"..>
  • if the RDF file is not exposed, then this is fine too, as long as it is directly ingested in some bigger catalog elsewhere (in which case the RDF file URL is known anyway) which allows RDF access (!)
  • depending on the total data time span, a hierarchical child dataset structure may be used instead:
    • /glotemp/YYYY -> /glotemp/YYYY/MM -> /glotemp/YYYY/MM/DD
    • this allows to make statements about a certain year or month grouping and may be more future-proof
    • it would also allow to attach future "zipped" (or merged netcdf etc.) distributions to a year or month for easier access
  • this would yield one DCAT dataset per day
    • PROBLEM: for many-year datasets, this could easily yield several thousand DCAT datasets
    • current catalog websites (e.g. CKAN) couldn't handle that properly, because:
      • they don't support proper parent-child relationships between (sub)datasets
      • they don't allow to conveniently filter along the time axis (available subset times are not shown directly in an overview)
    • Solution:
      • improve catalog software

Scenario 2: Daily gridded global temperature "glotemp2" (data API, no static files)

Data

  • 1 WMS endpoint with 1 layer with time dimension: /glotemp2/wms
  • 1 WCS endpoint: /glotemp2/wcs
  • summary website (/glotemp2) with preview images from WMS

Recommendation

DCAT structure

  • at a minimum a single DCAT dataset without any children:
    • URI: /glotemp2
    • link to summary website (foaf:homepage), here identical to dataset URI (not required though)
    • distributions: WMS and WCS
  • to provide more accessibility, have in addition DCAT child datasets as in Scenario 1:
    • URI: /glotemp2/YYYY-MM-DD
    • distributions: netcdf (via WCS); other formats offered by WCS; optional: preview image (link to WMS image)
    • link to root dataset via dct:isPartOf
    • correct temporal extent metadata

Notes

  • the same notes as for scenario 1 apply

Scenario 3: Daily gridded global temperature + country subsets "glotemp3" (daily static netcdf files)

Data

  • 1 WMS endpoint with 1 layer with time dimension: /glotemp3/wms
  • daily global netcdf files: /glotemp3/data/global/YYYY-MM-DD.nc
  • daily country-subsetted netcdf files: /glotemp3/data/COUNTRY/YYYY-MM-DD.nc, e.g. /glotemp3/data/de/2016-01-01.nc
    • Note that the set of daily country files do not equal the full daily dataset
  • summary website (/glotemp3) linking to netcdf files with preview images from WMS

Recommendation

DCAT structure

  • have one root DCAT dataset with one child DCAT dataset per day with one child DCAT dataset per country
  • root:
    • URI: /glotemp3
    • link to summary website (foaf:homepage), here identical to dataset URI (not required though)
    • distributions: WMS only
    • child datasets (daily global) linked via dct:hasPart
  • daily global:
    • URI: /glotemp3/YYYY-MM-DD
    • distributions: netcdf; optional: preview image (link to WMS image)
    • link to root dataset via dct:isPartOf
    • correct spatial and temporal extent metadata
    • child datasets (daily country) linked via dct:hasPart
  • daily country:
    • URI: /glotemp3/YYYY-MM-DD/COUNTRY
    • distributions: netcdf; optional: preview image (link to WMS image)
    • link to daily global dataset (/glotemp3/YYYY-MM-DD) via dct:isPartOf
    • correct spatial and temporal extent metadata

Notes

  • the same notes as for scenario 1 apply
  • it is important to include spatial and temporal metadata (GeoDCAT-AP) so that machine clients can find the right subdataset
  • spatial metadata should be given in multiple types and formats to allow greatest interoperability
    • bounding box / geometry (WKT, GeoJSON), but also canonical country URIs
  • PROBLEM: this would yield nearly 200 DCAT datasets per day
    • see notes for scenario 1

Scenario 4: Hourly country + city-level historic weather "weather" (data API, websites)

Data

  • hourly historic city-level RDF data in a sparql endpoint: /weather/sparql
    • filtering by country is possible
  • daily html for countries and cities: /weather/YYYY-MM-DD/COUNTRY.html and /weather/YYYY-MM-DD/COUNTRY/CITY.html
    • each site has a date picker to go to the other dates
  • daily RDF for countries and cities: /weather/YYYY-MM-DD/COUNTRY.jsonld and /weather/YYYY-MM-DD/COUNTRY/CITY.jsonld
  • content negotiation at /weather/YYYY-MM-DD/COUNTRY and /weather/YYYY-MM-DD/COUNTRY/CITY (302 redirect)
  • redirect from /weather/latest/COUNTRY to latest available /weather/YYYY-MM-DD/COUNTRY
  • redirect from /weather/latest/COUNTRY/CITY to latest available /weather/YYYY-MM-DD/COUNTRY/CITY
  • summary website (/weather) providing a country and city search

Recommendation

DCAT structure

  • have one root DCAT dataset with one child per day with one child per country with one child per city
  • PROBLEM: this would create too many DCAT datasets per day (~2 million)
    • (there is some overlap between a data API supporting subsetting and a bunch of connected DCAT datasets)
    • could be solved with "parametric"/"dynamic" datasets, but such a thing doesn't exist yet
    • Solution: have only a root dataset without children and point only to sparql and the summary website
    • TODO: Is that the best we can do?
    • how would the daily RDF dumps be referenced then?

Scenario 5: Yearly land cover UK "lc" (static netcdf files + data API)

Data

  • 1 WCS endpoint with 1 layer with time dimension: /lc/wcs
  • netcdf files: /lc/data/YYYY.nc, e.g. 2016.nc
  • summary website (/lc) linking to netcdf files

Recommendation

DCAT structure

  • have one root DCAT dataset with one child DCAT dataset per year
  • root:
    • URI: /lc
    • link to summary website (foaf:homepage), here identical to dataset URI (not required though)
    • distributions: WCS only
    • child datasets linked via dct:hasPart
  • child:
    • URI: /lc/YYYY
    • distributions: netcdf
    • link to root dataset via dct:isPartOf
    • correct temporal extent metadata

Notes

  • the same notes as for scenario 1 apply

Scenario 6: Historical soil moisture profiles UK "soil" (stored in DB, data APIs)

Data

  • REST API (JSON + HTML):
    • /soil/sites
    • /soil/sites/SITE
    • /soil/sites/SITE/tubes
    • /soil/sites/SITE/tubes/TUBE
    • /soil/sites/SITE/tubes/TUBE/YYYY-MM-DD (measurement data, not every day)
  • SPARQL endpoint: /soil/sparql
  • summary website (/soil) that links to the REST API

Recommendation

DCAT Structure

  • have one DCAT dataset:
    • URI: /soil
    • link to summary website (foaf:homepage), here identical to dataset URI (not required though)
    • distributions: SPARQL and entrypoint to REST API: /soil/sites

Notes

  • it probably doesn't make much sense here to expose all sites as subdatasets since there is no additional data that could be referenced here (REST API is already fully discoverable at the root)

Scenario 7: NL tree information (RDF/SPARQL, single-tree websites (by ID? +region?))

Data

  • TODO