-
Notifications
You must be signed in to change notification settings - Fork 178
ODC EP 012 Standardising EO3 metadata format
The ODC originally supported an extremely open-ended and flexible family of metadata formats.
The "EO3" family of metadata formats was introduced around v1.8.0 to allow improved performance in indexing and loading, although many non-eo3 formats were still supported. Note that EO3 is still extensible in some ways and is more of a family of metadata formats than a single format.
However, the minimum requirements for a metadata format to be "EO3 compatible" have never been formally defined, but were effectively defined by Python code distributed across multiple repositories, most notably datacube-core
, and eodatasets
.
This EP proposes the adoption of a formal standard for eo3 compatible metadata, and extensible tools for validating metadata against it.
Paul Haesler (@SpacemanPaul)
- In draft
- Under Discussion
- In Progress
- Completed
- Rejected
- Deferred
Support for non-EO3 datasets in datacube-core
adds unnecessary complexity and makes it hard to introduce new features or modify existing features, impeding innovation. It is not clear without manually inspecting schemas and multiple functions across multiple repositories to determine what constitutes a "eo3-compatible" dataset. There are no tools to validate whether a metadata type or product document is capable of working with an eo3-compatible dataset.
The most complete metadata validation toolset currently is eodatasets
- which depends on datacube-core and so cannot be used by datacube-core for validating files.
Some validation code is duplicated across repositories or sometimes within a repository -and sometimes the duplicated versions of a function behave inconsistently with each other.
There are undocumented differences between the external metadata documents indexed by the ODC and the metadata documents stored internally within the ODC index.
This all makes future changes or improvements to the ODC index layer and new features requiring new metadata much harder than they need to be.
I have forked the eodatasets
repository to create a new eo3
repository (TODO: Maybe renamed odc-eo3 for naming consistency).
I have stripped from the new eo3
repo all validation of site/collection-specific metadata properties, and added validation for any elements that core was making assumptions about that were not being validated by eodatasets.
I have tightened the checks and validations in the eo3
repo to only pass eo3-compatible metadata - legacy formats will fail validation.
I have tried to make the new validation code extensible so that eodatasets
can be refactored to extend the core validation methods in eo3
. It is also expected that datacube-core
will have the eo3
repo as a dependency. This extensible validation API (along with several other portions of the repo) is still a work in progress.
Most importantly the eo3 repo includes formal definitions of the formats used by eo3-compatible metadata type, product and dataset documents, and it is these formal documents that are the main subject of this EP:
- Proposed EO3 Metadata-Type Document format standard
- Proposed EO3 Product Document format standard
- Proposed EO3 Dataset Document format standard
- Drop support for pre-EO3 non-geospatial (e.g. telemetry) metadata and datasets, with a pathway to potentially reintroduce as vector-only (i.e. non-raster) EO3 datasets at some point in the future.
- Document which parts of metadata type documents are either ignored by the ODC or enforced to have canonical values, and provide a pathway to removing these parts by v2.0
- Require search-fields defined in an EO3-compatible metadata type reference STAC compatible property names (previously assumed/implied but optional in
eodataset
validation.) - Migration pathway for unused fields: Deprecate (and make optional where currently required) in v1.9, remove (ie forbid) in v2.0
- Search fields mostly restricted to flat entries under properties. Geotemporal search fields (lat, lon, crs, time) are grandfathered in as limited exceptions with a path to removal the metadata type (with geotemporal search and metadata being handled at the model and index layer APIs)
- Dataset Types are formally renamed product.
- The storage section is officially deprecated in v1.9 and removed in v2.0 (In favour of load)
- The managed field is deprecated (as ingestion is deprecated) in v1.9 and will be removed in v2.0
- Documented the undocumented, then formally specified it (load, storage, flags definitions, etc)
- Resolve the ambiguous location/locations field, standardising the behaviour in core over the assumptions in eodatasets (locations can be either a single location or a list, location to be deprecated and removed.)
- Documented the undocumented, then formally specified it.
- Paul Haesler (@SpacemanPaul)
Welcome to the Open Data Cube