-
Notifications
You must be signed in to change notification settings - Fork 178
ODC EP 013 Index Driver API cleanup
This EP is a proposal for a cleanup and rationalisation of the Index Driver API (i.e. the API that a new index driver is required to implement).
Details how backwards incompatibility and migration will be handled from 1.8 through 1.9 to 2.0.
Paul Haesler (@SpacemanPaul)
- In draft
- Under Discussion
- In Progress
- Completed will be in 1.9.0
- Rejected
- Deferred
The index driver API has evolved organically over time, mostly in an environment where there was only one index driver implementing it.
Now that there are multiple index drivers (and vague plans for more), the technical debt accrued during this ad hoc growth and evolution is starting to present unnecessary obstacles to both the development of future index drivers and the maintenance of existing drivers.
The aim of this EP is to simplify and minimise the effort required to implement a new index driver, and to allow the codebases for existing index driver to be cleaned up and simplified.
Wherever possible, new methods will be introduced and old methods deprecated in 1.9.x releases, with deprecated methods removed in 2.0.x releases. Backwards compatibility between 1.8.x and 1.9.x releases will be preserved where possible (apart from deprecation warnings).
In 1.8, AbstractIndexDriver
defines two abstract methods:
-
connect_to_index
: Simply callsfrom_config()
from the driver's AbstractIndex implementation. -
metadata_type_from_doc
: Builds an unpersisted MetadataType model from an MDT document (i.e. a dictionary). Essentially a duplicate of thefrom_doc()
method on the Metadata Resource (see below).
Proposal:
-
index_class
: New abstract method, returns the driver's AbstractIndex implementation. (1.9) -
connect_to_index
: no longer abstract. Callsself.index_class().from_config(...)
directly (1.9) -
metadata_type_from_doc
: Deprecate in 1.9, remove in 2.0 - recommend migration toindex.metadata_types.from_doc()
AbstractIndex
defines a set of boolean flags which implementations can override to specify which parts of the API they support.
The supports flags are relatively recent (introduced in 1.8.8, October 2022) and are only relevant to users working with different index drivers and developers of new drivers. Strict backwards compatibility is therefore not a driving concern in this case, but backwards incompatible changes are noted.
The basic concept seems sound, but this is an opportunity to cleanup and formalise.
In 1.8, some flags default to True and some to False, and implementing indexes have to explicitly set only which flags differ from the default.
From 1.9, all flags will default to False. All index implementations must explicitly set flags for all features they support.
These flags indicate which metadata types the index supports. supports_vector
is a new addition, the rest already exist in 1.8 - e.g. this is how the postgis driver advertises that it only supports EO3 compatible metadata.
-
supports_legacy
: supports legacy (non-eo3) ODC metadata types (e.g. eo, telemetry) -
supports_eo3
: supports eo3 compatible metadata types. -
supports_nongeo
: supports non-geospatial metadata types (e.g. telemetry). No dependency onsupports_legacy
to allow for future non-geospatial metadata types with eo3 style flattened metadata. -
supports_vector
: supports geospatial non-raster metadata types. Reserved for future use.
These flags indicate which database/storage capabilities the index supports:
-
supports_write
: Supports methods like add, remove and update. E.g. an index driver providing access to a STAC API would set this to False. -
supports_persistence
: Supports persistent storage. Storage writes from previous instantiations will persist into future ones - e.g. the in-memory driver supports write but does not support persistence. Requiressupports_write
. -
supports_transactions
: Supports database transactions - e.g. the in-memory driver does not support transactions. -
supports_spatial_indexes
: Supports the creation of per-CRS spatial indexes - e.g. the postgis driver supports spatial indexes.
Note backwards incompatible change from 1.8: From 1.9, 1.8's supports_persistence
is renamed support_write
and a new supports_persistence
flag with a slightly different interpretation is introduced.
This flag indicates whether the index supports the user management methods exposed by index.users
.
-
supports_users
: Supports database user management, e.g. a SQL-Lite index driver would not support users.
This flag is new in 1.9
These flags indicate if and how the index driver supports dataset lineage.
-
supports_lineage
: Supports some kind of lineage storage - either legacy style (withsource_filter
option in queries); or external lineage, as per EP-08. -
supports_external_lineage
: If true, supports EP-08 style external lineage API. Requiressupports_lineage
. -
supports_external_home
": If true, supports external home lineage data, as per EP-08. Requiressupports_external_lineage
.
In 1.8, there is a supports_source_filters
flag. This is removed in 1.9 as it is equivalent to supports_lineage and not supports_external_lineage
.
- The type signature of the
from_config()
class method changes to take anODCEnvironment
instead of aLocalConfig
in 1.9 as per the new config API (see EP-10). - Spatial index management methods are added in 1.9 (
create_spatial_index
,update_spatial_index
, drop_spatial_index`).
No changes proposed for User Resource API, except to make implementation optional by setting supports_user_management
to False, as discussed above.
Lineage Resource is new 1.9, see EP-08.
No changes proposed.
Proposed new method:
-
get_with_fields(field_names: Iterable[str]) -> Iterable[MetadataType]
: Returns all metadata types that have all the named search fields.
Note that the existing method of the same name in the product resource becomes a wrapper to this.
No other proposed changes.
-
get_with_fields(field_names: Iterable[str]) -> Iterable[Product]
: Implement in base class as a wrapper aroundmetadata_types.get_with_fields
above andget_with_types
below. -
get_with_types(types: Iterable[MetadataType]) -> Iterable[Product]
: Proposed new method. Can be implemented in the base class viaget_all()
. -
get_field_names(product: Product | str | None = None) -> Iterable[str]
: Replaces the method of the same name in the dataset resource. Signature expanded to take a Product or a product name. Can be implemented in the base class. - New methods (see extent methods section under dataset resource below):
spatial_extent(product: Product | str, crs: CRS = CRS(4326)) -> Geometry
temporal_extent(product: Product | str) -> tuple[datetime.datetime, datetime.datetime]
No other proposed changes.
-
get_unsafe(id_: UUID | str, include_sources: bool = False) -> Dataset
: New method for consistency with the other Resource APIs. Raises aKeyError
if the supplied id does not exist. -
get(id: UUID, include_sources: bool = False) -> Dataset
: Implement in base class viaget_unsafe
above.
NOTE: The behaviour of get(id_, include_sources=True)
differs based on whether the driver supports_external_lineage
as per EP-08. Tthis will be implemented from 1.9.
Existing has
method unchanged.
- Bulk add method used by clone:
_add_batch()
- no changes proposed. - Very old (1.8) bulk read methods:
bulk_get
,bulk_has
. (Take iterables of IDs, return Datasets (or bools forhas
).) - New bulk read methods used by clone:
get_all_docs_for_product
(get_all_docs
callsget_all_docs_for_product
, returns tuples of: Product, document, uris - but does not assemble them into Datsets) - Old bulk read method used to "archive all (active datasets)" and "restore all (archived datasets)" and "purge all (archived datasets)":
get_all_dataset_ids()
(Returns IDs only)
Propose:
- Deprecate
get_all_dataset_ids
from 1.9 and remove in 2.0 (recommend migrate tosearch_returning
)
-
get_derived(id_)
: Deprecate in 1.9, remove in 2.0 (superceded by EP08 Lineage API).
-
get_locations()
,get_archived_locations()
,get_archived_location_times()
add_location()
get_datasets_for_location()
-
remove_location()
,archive_location()
,restore_location()
These methods are an obvious symptom of the complexity introduced by supporting multiple locations. I'm not aware of anyone actually using multiple locations (and I'm not 100% it would work correctly if you tried).
Propose deprecating most of these methods in 1.9 and removing in 2.0, dropping support for multiple locations all together - from 2.0 support a single location only and only the following new methods:
-
get_location(id_: str | UUID) -> str:
which replacesget_locations
-
get_datasets_for_location(uri, mode=None):
Keep around for now. Once multiple location support is dropped, we can move this functionality into the search methods.
Note that location can already be updated with the datasets.update()
method.
The DatasetTuple
will magically support both uri: str
and uris: Sequence[str]
for the final argument for 1.9, and revert to uri
only in 2.0. The postgis driver may drop support for single locations before 2.0.
The local_uri
and local_path
will be kept. After support for multiple locations is dropped, their behaviour will naturally degrade to:
-
local_uri()
return the uri if it local (file:
) uri, or None if it is an external URI. -
local_path()
will return the uri as a local file path, or None if the uri is an external URI.
The behaviour of dataset.update() will change also. Previously an update with a new location always added the location, keeping the old one. From 1.9 dataset.update() with a new location replaces the existing location (unless there are already multiple locations, or the updated dataset is passed in with multiple locations, in which case the current merge behaviour will persist until multiple location support is dropped in 2.0).
-
spatial_extent(ids: Iterable[UUID | str], crs: CRS | None =None) -> Geometry
: Only supported by a driver thatsupports_spatial_indexes
(i.e. not supported by legacy driver) - `get_product_time_bounds(product: Product) -> Tuple[datetime, datetime]
Propose:
- New
temporal_extent()
method that takes a list of dataset IDs. - New
spatial_extent()
andtemporal_extent()
methods on ProductResource that take a product id. - deprecate
get_product_time_bounds()
- recommend new ProductResource or DatasetResource temporal_extent() method, and remove in 2.0.
NB. For boring technical reasons, the dataset version of temporal_extent
method is difficult to implement cleanly and efficiently in the postgres driver. This method may be left unimplemented in the postgres driver.
This is where things get messy. I'll try to keep it as clear as possible.
-
ALL search methods only return active (non-archived) datasets - no documented way to include archived datasets....
-
search_by_metadata()
: Current typehint signature is incomplete - does not allow for nested metadata chunks to be passed in. Unlike all other search methods, this does NOT exclude archived datasets, behaviour which is neither consistent nor documented. -
search_eager()
: Misleadingly named and useless. Simply callssearch()
and returns the result as a list - so depending on how how you interpret "eager", it's either the exact opposite of eager, or no more eager than a regular search. -
search_returning_datasets_light()
: Has some cool and interesting features but is poorly documented, has a design that is tightly coupled to the postgres index driver, and a complex implementation that violates the modularity established by the rest of the API. Furthermore I can't find any code anywhere that uses it. Propose deprecating in 1.9 and removing in 2.0. -
search
,search_by_product
,search_returning
,search_summaries
:-
In both the postgres and postgis drivers, these are all implemented as wrappers around a common private method
_do_search_by_product()
. This performs a product search first, then separate dataset searches for each matching or partially matching product. This makes some sense in the context of the postgres driver, but is less useful for the postgis driver. It makes "eager" searching impossible - there will always be a significant delay before returning the first matching dataset. -
search_returning()
andsearch_summaries()
are functionally very closely related -search_summaries()
is basically a special case ofsearch_returning()
with a different return format. -
Despite all these methods being wrappers around the same function, special arguments are exposed inconsistently, being offered arbitrarily by some methods but not others.
-
search()
nominally supports "source filters" (i.e. "find datasets derived from datasets that match these filters") This is not supported by a driver thatsupports_external_lineage
(like postgis), as per EP-08.
-
- Update typehints of
search_by_metadata()
method to reflect actual behaviour. - Update documentation of
search()
method to say that results are not guaranteed to be sorted/grouped by product. This frees up the postgis driver to perform a more efficient direct (and eager) search in future. - Make
field_names
argument tosearch_returning()
optional - default is all search fields. - Deprecate
search_eager()
in 1.9 and remove in 2.0 - suggestsearch(..., fetch_all=True)
- or simply wrappinglist()
around the result. - Deprecate
search_summaries()
in 1.9 and remove in 2.0 - suggest migration tosearch_returning()
. - Add
archived: bool | None = False
argument to ALL search methods. False = return active datasets only (default - on all methods), True = return archived datasets only, None = return both active and archived datasets. - Add
custom_offsets
argument (as persearch_returning_datasets_light()
) tosearch_returning()
. - Add
order_by: str | Field | None = None
argument tosearch_returning()
. None will mean unsorted. Postgres driver will leave unsupported. Postgis driver should be able to bypass the partial product search and start returning results immediately iforder_by
andcustom_offsets
are both None. - Add
fetch_all: bool = False
argument to all search methods. True returns results as a list, False (default) returns a generator. - Deprecate
search_returning_datasets_light()
in 1.9 and remove in 2.0 - suggest migration tosearch_returning()
- Note that most other search methods can be trivially reimplemented as wrappers around the new expanded
search_returning()
method - the abstract base class will offer this as the default implementation (and the postgis driver will take advantage of it). - Remove all internal usages in core of all deprecated methods, etc. This will have some backwards incompatible side-effects:
- In 1.8 the CLI command
datacube dataset search
callssearch_summaries()
. From 1.9 it will call search_returning(). These behave identically if there is one active location per dataset, however the way datasets with multiple active locations (or no active locations) are returned will change from 1.9 (1.8: one row per active uri, 1.9: one row per dataset). Note that multiple locations are deprecated in 1.9.
- In 1.8 the CLI command
Add new archived
argument (as per search) to all count methods.
No changes proposed for count()
or count_by_product()
.
count_product_through_time()
and count_by_product_through_time()
are closely related (as their confusingly similar names suggest). The latter returns counts by time-range per product (Iterable[Tuple[Product, Tuple[Range, int]]]
). The former dispenses with the product grouping (Iterable[Tuple[Range, int]]
) AND enforces that the query only includes datasets for one product. Propose deprecating count_product_through_time()
in 1.9 (and recommending migrating to count_by_product_through_time()
) and removing in 2.0
New method count_by(fields: Iterable[str|Field], custom_offsets: Mapping[str, Offset] | None = None, **query: QueryField) -> Iterable[Tuple[Tuple, int]]
The Tuple[Tuple, int]
is a tuple containing a named tuple with the requested fields and/or custom-offset values, and the relevant counts. count
and count_by_product
can then be reimplemented as wrappers around count_by
in the base class.
No changes are proposed to the following classes of methods:
- atomic write (
add
,update
,archive
,restore
,purge
); - update support (
can_update
)
The following method will be deprecated in 1.9 and removed in 2.0 as it is replaced by a method of the same name on the product resource (see above):
get_field_names()
Welcome to the Open Data Cube