-
Notifications
You must be signed in to change notification settings - Fork 1
DCAT partitioning ideas
Maik Riechert edited this page Feb 3, 2016
·
7 revisions
Concepts:
- dct:hasPart for non-overlapping hierarchical data
Principles:
- make all available data of a dataset machine-accessible via at least one path (machines are smart and are assumed to be able to navigate through APIs)
Bad practices:
- attach several subsetted files directly to the main DCAT dataset (each distribution must contain the whole dataset)
Problems:
- what if partitioning into subdatasets happens on a dimension that cannot be expressed with standard spatial/temporal metadata
- how would clients / catalogs be able to organize that?
- 1 WMS endpoint with 1 layer with time dimension: /glotemp/wms
- netcdf files: /glotemp/data/YYYY-MM-DD.nc, e.g. 2016-01-01.nc
- summary website (/glotemp) linking to netcdf files with preview images from WMS
- have one root DCAT dataset with one child DCAT dataset per day
- root:
- URI: /glotemp
- link to summary website (foaf:homepage), here identical to dataset URI (not required though)
- distributions: WMS only
- child datasets linked via dct:hasPart
- child:
- URI: /glotemp/YYYY-MM-DD
- distributions: netcdf; optional: preview image (link to WMS image)
- link to root dataset via dct:isPartOf
- correct temporal extent metadata
- it is important to include temporal metadata (GeoDCAT-AP) so that machine clients can find the right subdataset
- /glotemp/YYYY-MM-DD does not resolve to anything, it is just an identifier
- this is fine for a basic implementation
- advanced implementations are not always achievable within time/money constraints
- the RDF DCAT data should be stored in a file, e.g. /glotemp/dcat.jsonld
- the RDF file should be exposed at /glotemp via:
- content negotiation (302 redirect) and/or
- Link (rel=alternate) header and/or
- in HTML summary page via
<link rel="alternate"..>
- if the RDF file is not exposed, then this is fine too, as long as it is directly ingested in some bigger catalog elsewhere (in which case the RDF file URL is known anyway) which allows RDF access (!)
- depending on the total data time span, a hierarchical child dataset structure may be used instead:
- /glotemp/YYYY -> /glotemp/YYYY/MM -> /glotemp/YYYY/MM/DD
- this allows to make statements about a certain year or month grouping and may be more future-proof
- it would also allow to attach future "zipped" (or merged netcdf etc.) distributions to a year or month for easier access
- this would yield one DCAT dataset per day
- PROBLEM: for many-year datasets, this could easily yield several thousand DCAT datasets
- current catalog websites (e.g. CKAN) couldn't handle that properly, because:
- they don't support proper parent-child relationships between (sub)datasets
- they don't allow to conveniently filter along the time axis (available subset times are not shown directly in an overview)
- Solution:
- improve catalog software
- 1 WMS endpoint with 1 layer with time dimension: /glotemp2/wms
- 1 WCS endpoint: /glotemp2/wcs
- summary website (/glotemp2) with preview images from WMS
- at a minimum a single DCAT dataset without any children:
- URI: /glotemp2
- link to summary website (foaf:homepage), here identical to dataset URI (not required though)
- distributions: WMS and WCS
- to provide more accessibility, have in addition DCAT child datasets as in Scenario 1:
- URI: /glotemp2/YYYY-MM-DD
- distributions: netcdf (via WCS); other formats offered by WCS; optional: preview image (link to WMS image)
- link to root dataset via dct:isPartOf
- correct temporal extent metadata
- the same notes as for scenario 1 apply
Scenario 3: Daily gridded global temperature + country subsets "glotemp3" (daily static netcdf files)
- 1 WMS endpoint with 1 layer with time dimension: /glotemp3/wms
- daily global netcdf files: /glotemp3/data/global/YYYY-MM-DD.nc
- daily country-subsetted netcdf files: /glotemp3/data/COUNTRY/YYYY-MM-DD.nc, e.g. /glotemp3/data/de/2016-01-01.nc
- Note that the set of daily country files do not equal the full daily dataset
- summary website (/glotemp3) linking to netcdf files with preview images from WMS
- have one root DCAT dataset with one child DCAT dataset per day with one child DCAT dataset per country
- root:
- URI: /glotemp3
- link to summary website (foaf:homepage), here identical to dataset URI (not required though)
- distributions: WMS only
- child datasets (daily global) linked via dct:hasPart
- daily global:
- URI: /glotemp3/YYYY-MM-DD
- distributions: netcdf; optional: preview image (link to WMS image)
- link to root dataset via dct:isPartOf
- correct spatial and temporal extent metadata
- child datasets (daily country) linked via dct:hasPart
- daily country:
- URI: /glotemp3/YYYY-MM-DD/COUNTRY
- distributions: netcdf; optional: preview image (link to WMS image)
- link to daily global dataset (/glotemp3/YYYY-MM-DD) via dct:isPartOf
- correct spatial and temporal extent metadata
- the same notes as for scenario 1 apply
- it is important to include spatial and temporal metadata (GeoDCAT-AP) so that machine clients can find the right subdataset
- spatial metadata should be given in multiple types and formats to allow greatest interoperability
- bounding box / geometry (WKT, GeoJSON), but also canonical country URIs
- PROBLEM: this would yield nearly 200 DCAT datasets per day
- see notes for scenario 1
- hourly historic city-level RDF data in a sparql endpoint: /weather/sparql
- filtering by country is possible
- daily html for countries and cities: /weather/YYYY-MM-DD/COUNTRY.html and /weather/YYYY-MM-DD/COUNTRY/CITY.html
- each site has a date picker to go to the other dates
- daily RDF for countries and cities: /weather/YYYY-MM-DD/COUNTRY.jsonld and /weather/YYYY-MM-DD/COUNTRY/CITY.jsonld
- content negotiation at /weather/YYYY-MM-DD/COUNTRY and /weather/YYYY-MM-DD/COUNTRY/CITY (302 redirect)
- redirect from /weather/latest/COUNTRY to latest available /weather/YYYY-MM-DD/COUNTRY
- redirect from /weather/latest/COUNTRY/CITY to latest available /weather/YYYY-MM-DD/COUNTRY/CITY
- summary website (/weather) providing a country and city search
- have one root DCAT dataset with one child per day with one child per country with one child per city
- PROBLEM: this would create too many DCAT datasets per day (~2 million)
- (there is some overlap between a data API supporting subsetting and a bunch of connected DCAT datasets)
- could be solved with "parametric"/"dynamic" datasets, but such a thing doesn't exist yet
- Solution: have only a root dataset without children and point only to sparql and the summary website
- TODO: Is that the best we can do?
- how would the daily RDF dumps be referenced then?
- 1 WCS endpoint with 1 layer with time dimension: /lc/wcs
- netcdf files: /lc/data/YYYY.nc, e.g. 2016.nc
- summary website (/lc) linking to netcdf files
- have one root DCAT dataset with one child DCAT dataset per year
- root:
- URI: /lc
- link to summary website (foaf:homepage), here identical to dataset URI (not required though)
- distributions: WCS only
- child datasets linked via dct:hasPart
- child:
- URI: /lc/YYYY
- distributions: netcdf
- link to root dataset via dct:isPartOf
- correct temporal extent metadata
- the same notes as for scenario 1 apply
- REST API (JSON + HTML):
- /soil/sites
- /soil/sites/SITE
- /soil/sites/SITE/tubes
- /soil/sites/SITE/tubes/TUBE
- /soil/sites/SITE/tubes/TUBE/YYYY-MM-DD (measurement data, not every day)
- SPARQL endpoint: /soil/sparql
- summary website (/soil) that links to the REST API
- have one DCAT dataset:
- URI: /soil
- link to summary website (foaf:homepage), here identical to dataset URI (not required though)
- distributions: SPARQL and entrypoint to REST API: /soil/sites
- it probably doesn't make much sense here to expose all sites as subdatasets since there is no additional data that could be referenced here (REST API is already fully discoverable at the root)
- TODO