DCAT partitioning ideas

Concepts:

dct:hasPart for non-overlapping hierarchical data

Principles:

make all available data of a dataset machine-accessible via at least one path (machines are smart and are assumed to be able to navigate through APIs)

Bad practices:

attach several subsetted files directly to the main DCAT dataset (each distribution must contain the whole dataset)

Problems:

what if partitioning into subdatasets happens on a dimension that cannot be expressed with standard spatial/temporal metadata
- how would clients / catalogs be able to organize that?

Scenario 1: Daily gridded global temperature "glotemp" (daily static netcdf files)

Data

1 WMS endpoint with 1 layer with time dimension: /glotemp/wms
netcdf files: /glotemp/data/YYYY-MM-DD.nc, e.g. 2016-01-01.nc
summary website (/glotemp) linking to netcdf files with preview images from WMS

Recommendation

DCAT structure

have one root DCAT dataset with one child DCAT dataset per day
root:
- URI: /glotemp
- link to summary website (foaf:homepage), here identical to dataset URI (not required though)
- distributions: WMS only
- child datasets linked via dct:hasPart
child:
- URI: /glotemp/YYYY-MM-DD
- distributions: netcdf; optional: preview image (link to WMS image)
- link to root dataset via dct:isPartOf
- correct temporal extent metadata

Notes

it is important to include temporal metadata (GeoDCAT-AP) so that machine clients can find the right subdataset
/glotemp/YYYY-MM-DD does not resolve to anything, it is just an identifier
this is fine for a basic implementation
advanced implementations are not always achievable within time/money constraints
the RDF DCAT data should be stored in a file, e.g. /glotemp/dcat.jsonld
the RDF file should be exposed at /glotemp via:
- content negotiation (302 redirect) and/or
- Link (rel=alternate) header and/or
- in HTML summary page via <link rel="alternate"..>
if the RDF file is not exposed, then this is fine too, as long as it is directly ingested in some bigger catalog elsewhere (in which case the RDF file URL is known anyway) which allows RDF access (!)
depending on the total data time span, a hierarchical child dataset structure may be used instead:
- /glotemp/YYYY -> /glotemp/YYYY/MM -> /glotemp/YYYY/MM/DD
- this allows to make statements about a certain year or month grouping and may be more future-proof
- it would also allow to attach future "zipped" (or merged netcdf etc.) distributions to a year or month for easier access
this would yield one DCAT dataset per day
- PROBLEM: for many-year datasets, this could easily yield several thousand DCAT datasets
- current catalog websites (e.g. CKAN) couldn't handle that properly, because:
  - they don't support proper parent-child relationships between (sub)datasets
  - they don't allow to conveniently filter along the time axis (available subset times are not shown directly in an overview)
- Solution:
  - improve catalog software

Scenario 2: Daily gridded global temperature "glotemp2" (data API, no static files)

Data

1 WMS endpoint with 1 layer with time dimension: /glotemp2/wms
1 WCS endpoint: /glotemp2/wcs
summary website (/glotemp2) with preview images from WMS

Recommendation

DCAT structure

at a minimum a single DCAT dataset without any children:
- URI: /glotemp2
- link to summary website (foaf:homepage), here identical to dataset URI (not required though)
- distributions: WMS and WCS
to provide more accessibility, have in addition DCAT child datasets as in Scenario 1:
- URI: /glotemp2/YYYY-MM-DD
- distributions: netcdf (via WCS); other formats offered by WCS; optional: preview image (link to WMS image)
- link to root dataset via dct:isPartOf
- correct temporal extent metadata

Notes

the same notes as for scenario 1 apply

Scenario 3: Daily gridded global temperature + country subsets "glotemp3" (daily static netcdf files)

Data

1 WMS endpoint with 1 layer with time dimension: /glotemp3/wms
daily global netcdf files: /glotemp3/data/global/YYYY-MM-DD.nc
daily country-subsetted netcdf files: /glotemp3/data/COUNTRY/YYYY-MM-DD.nc, e.g. /glotemp3/data/de/2016-01-01.nc
- Note that the set of daily country files do not equal the full daily dataset
summary website (/glotemp3) linking to netcdf files with preview images from WMS

Recommendation

DCAT structure

have one root DCAT dataset with one child DCAT dataset per day with one child DCAT dataset per country
root:
- URI: /glotemp3
- link to summary website (foaf:homepage), here identical to dataset URI (not required though)
- distributions: WMS only
- child datasets (daily global) linked via dct:hasPart
daily global:
- URI: /glotemp3/YYYY-MM-DD
- distributions: netcdf; optional: preview image (link to WMS image)
- link to root dataset via dct:isPartOf
- correct spatial and temporal extent metadata
- child datasets (daily country) linked via dct:hasPart
daily country:
- URI: /glotemp3/YYYY-MM-DD/COUNTRY
- distributions: netcdf; optional: preview image (link to WMS image)
- link to daily global dataset (/glotemp3/YYYY-MM-DD) via dct:isPartOf
- correct spatial and temporal extent metadata

Notes

the same notes as for scenario 1 apply
it is important to include spatial and temporal metadata (GeoDCAT-AP) so that machine clients can find the right subdataset
spatial metadata should be given in multiple types and formats to allow greatest interoperability
- bounding box / geometry (WKT, GeoJSON), but also canonical country URIs
PROBLEM: this would yield nearly 200 DCAT datasets per day
- see notes for scenario 1

Scenario 4: Hourly country + city-level historic weather "weather" (data API, websites)

Data

hourly historic city-level RDF data in a sparql endpoint: /weather/sparql
- filtering by country is possible
daily html for countries and cities: /weather/YYYY-MM-DD/COUNTRY.html and /weather/YYYY-MM-DD/COUNTRY/CITY.html
- each site has a date picker to go to the other dates
daily RDF for countries and cities: /weather/YYYY-MM-DD/COUNTRY.jsonld and /weather/YYYY-MM-DD/COUNTRY/CITY.jsonld
content negotiation at /weather/YYYY-MM-DD/COUNTRY and /weather/YYYY-MM-DD/COUNTRY/CITY (302 redirect)
redirect from /weather/latest/COUNTRY to latest available /weather/YYYY-MM-DD/COUNTRY
redirect from /weather/latest/COUNTRY/CITY to latest available /weather/YYYY-MM-DD/COUNTRY/CITY
summary website (/weather) providing a country and city search

Recommendation

DCAT structure

have one root DCAT dataset with one child per day with one child per country with one child per city
PROBLEM: this would create too many DCAT datasets per day (~2 million)
- (there is some overlap between a data API supporting subsetting and a bunch of connected DCAT datasets)
- could be solved with "parametric"/"dynamic" datasets, but such a thing doesn't exist yet
- Solution: have only a root dataset without children and point only to sparql and the summary website
- TODO: Is that the best we can do?
- how would the daily RDF dumps be referenced then?

Scenario 5: Yearly land cover UK "lc" (static netcdf files + data API)

Data

1 WCS endpoint with 1 layer with time dimension: /lc/wcs
netcdf files: /lc/data/YYYY.nc, e.g. 2016.nc
summary website (/lc) linking to netcdf files

Recommendation

DCAT structure

have one root DCAT dataset with one child DCAT dataset per year
root:
- URI: /lc
- link to summary website (foaf:homepage), here identical to dataset URI (not required though)
- distributions: WCS only
- child datasets linked via dct:hasPart
child:
- URI: /lc/YYYY
- distributions: netcdf
- link to root dataset via dct:isPartOf
- correct temporal extent metadata

Notes

the same notes as for scenario 1 apply

Scenario 6: Historical soil moisture profiles UK "soil" (stored in DB, data APIs)

Data

REST API (JSON + HTML):
- /soil/sites
- /soil/sites/SITE
- /soil/sites/SITE/tubes
- /soil/sites/SITE/tubes/TUBE
- /soil/sites/SITE/tubes/TUBE/YYYY-MM-DD (measurement data, not every day)
SPARQL endpoint: /soil/sparql
summary website (/soil) that links to the REST API

Recommendation

DCAT Structure

have one DCAT dataset:
- URI: /soil
- link to summary website (foaf:homepage), here identical to dataset URI (not required though)
- distributions: SPARQL and entrypoint to REST API: /soil/sites

Notes

it probably doesn't make much sense here to expose all sites as subdatasets since there is no additional data that could be referenced here (REST API is already fully discoverable at the root)

Scenario 7: NL tree information (RDF/SPARQL, single-tree websites (by ID? +region?))

Data

TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DCAT partitioning ideas

Scenario 1: Daily gridded global temperature "glotemp" (daily static netcdf files)

Data

Recommendation

DCAT structure

Notes

Scenario 2: Daily gridded global temperature "glotemp2" (data API, no static files)

Data

Recommendation

DCAT structure

Notes

Scenario 3: Daily gridded global temperature + country subsets "glotemp3" (daily static netcdf files)

Data

Recommendation

DCAT structure

Notes

Scenario 4: Hourly country + city-level historic weather "weather" (data API, websites)

Data

Recommendation

DCAT structure

Scenario 5: Yearly land cover UK "lc" (static netcdf files + data API)

Data

Recommendation

DCAT structure

Notes

Scenario 6: Historical soil moisture profiles UK "soil" (stored in DB, data APIs)

Data

Recommendation

DCAT Structure

Notes

Scenario 7: NL tree information (RDF/SPARQL, single-tree websites (by ID? +region?))

Data

Clone this wiki locally