Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task: unify odc.stac.transform with odc.stac._eo3 #355

Closed
Kirill888 opened this issue Sep 7, 2021 · 6 comments
Closed

Task: unify odc.stac.transform with odc.stac._eo3 #355

Kirill888 opened this issue Sep 7, 2021 · 6 comments

Comments

@Kirill888
Copy link
Member

Problem Description

Module odc.stac.transform (previously odc.index.stac) was developed before pystac was available and before STAC 1.0 was finalized. It's purpose is to translate a STAC document to EO3 compatible Dataset definition document suitable for indexing to datacube.

There are some issues with the current implementation that I would like to address

Module odc.stac._eo3 does similar thing, except the goal was to produce Dataset objects suitable for calling dc.load with, rather than a yaml document suitable for indexing. As such it's missing some of the capabilities required by odc.stac.transform, such as deterministic UUID generation, lineage extraction, region code and other metadata massaging.

Sub-tasks

  • Implement deterministic UUID generation in odc.stac._eo3 (currently using random)
    1. Use .id as is if it contains UUID (DEA datasets)
    2. Use .id + .collection_id + (optional other fields configured by user per collection) to compute deterministic UUID
  • Add support for remapping properties during dataset construction
  • Switch odc.stac.transform to use odc.stac.stac2ds possibly with some further metadata tweaking post conversion (region code, product href, lineage)

Note that deterministic UUIDs have potential to benefit stac_load as well when used with Dask. Non-remapping of properties is probably a bug, as time ranges are probably broken currently: EO3 metadata looks for old names for end_datetime,start_datetime.

CC: @alexgleith @gadomski

Kirill888 added a commit that referenced this issue Sep 7, 2021
Unless random UUIDs are requested generate the same UUID for STAC items with the
same id from the same collection.
Kirill888 added a commit that referenced this issue Sep 7, 2021
Unless random UUIDs are requested generate the same UUID for STAC items with the
same id from the same collection.
Kirill888 added a commit that referenced this issue Sep 7, 2021
EO3 uses some pre-STAC 1.0 properties to lookup some keys, this is particularly
important for star/end dates.
Kirill888 added a commit to Kirill888/odc-tools that referenced this issue Sep 7, 2021
Unless random UUIDs are requested generate the same UUID for STAC items with the
same id from the same collection.
Kirill888 added a commit to Kirill888/odc-tools that referenced this issue Sep 7, 2021
EO3 uses some pre-STAC 1.0 properties to lookup some keys, this is particularly
important for star/end dates.
Kirill888 added a commit that referenced this issue Sep 7, 2021
Unless random UUIDs are requested generate the same UUID for STAC items with the
same id from the same collection.
Kirill888 added a commit that referenced this issue Sep 7, 2021
EO3 uses some pre-STAC 1.0 properties to lookup some keys, this is particularly
important for star/end dates.
Kirill888 added a commit that referenced this issue Sep 7, 2021
Unless random UUIDs are requested generate the same UUID for STAC items with the
same id from the same collection.
Kirill888 added a commit that referenced this issue Sep 7, 2021
EO3 uses some pre-STAC 1.0 properties to lookup some keys, this is particularly
important for star/end dates.
@Kirill888
Copy link
Member Author

progress so far

UUID generation is now deterministic and sufficiently configurable, default UUID resolution goes like this

  1. Check if id is already in UUID format, in which case use that (this is for ODC generated datasets)
  2. Use Item id and collection_id fields from STAC to generate deterministic UUID
  3. [optional] user can configure extra fields that should be included as well, those must come from properties section of the Item

Property name remapping to match eo3 is done.

still to do

  • customizing product name (easy to add)
  • deal with lineage in ODC format only (should not be too complicated)
  • understand what difference are in documents produced by stac/_eo3.py vs stac/transform.py, and what impact that has
    • for example deterministic UUID produced by _eo3.py will be different from what is done in transform.py, does it matter (@alexgleith)?
    • Handling of "brokenness" in STAC items, for example E84 Sentinel-2 on AWS currently doesn't specify proj extension using STAC 1.0 syntax and so seen by _eo3.py as not having proj data, but transform.py looks for proj data inside STAC item dict without checking extension so it can access it.
  • Will probably need more customization hooks to support behaviour of transform.py

@Kirill888 Kirill888 changed the title Task: unfiy odc.stac.transform with odc.stac._eo3 Task: unify odc.stac.transform with odc.stac._eo3 Oct 4, 2021
@Kirill888
Copy link
Member Author

Another missing feature in _eo3.py is "common prefix extraction" for assets. For data loading purposes it's fine to just use "absolute path" to refer to files, but for indexing a "relocatable" representation is desired. This involves translating asset locations into a "relative" representation where possible, it's possible when all assets of interest reside under common prefix.

@alexgleith
Copy link
Contributor

Having different deterministic UUIDs will break things for me, yeah. If I've indexed Sentinel-2 or Landsat 8 data, which doesn't come with a UUID already, I'm relying on that being consistent to know if it's already in the DB.

What's wrong with the current deterministic ID?

@Kirill888
Copy link
Member Author

It's not generic enough, special rules for specific product names:

if _check_valid_uuid(input_stac["id"]):
deterministic_uuid = input_stac["id"]
else:
if product_name in ["s2_l2a"]:
deterministic_uuid = str(
odc_uuid("sentinel-2_stac_process", "1.0.0", [dataset_id])
)
else:
deterministic_uuid = str(
odc_uuid(f"{product_name}_stac_process", "1.0.0", [dataset_id])
)

Also I don't think using sources= is appropriate, sources are for lineage, stac.id is not lineage, it is the dataset.

I guess we can allow user-supplied uuid generation function, in here:

def _compute_uuid(
item: pystac.Item, mode: str = "auto", extras: Optional[Sequence[str]] = None
) -> uuid.UUID:
if mode == "native":
return uuid.UUID(item.id)
if mode == "random":
return uuid.uuid4()
assert mode == "auto"
# 1. see if .id is already a UUID
try:
return uuid.UUID(item.id)
except ValueError:
pass
# 2. .id, .collection_id, [extras]
_extras = (
{} if extras is None else {key: item.properties.get(key, "") for key in extras}
)
return odc_uuid(item.collection_id, "stac", [], stac_id=item.id, **_extras)

@alexgleith
Copy link
Contributor

I guess having a user-defined function for UUID generation is a good workaround. It'll need to be added as a parameter for the dc_tools suite.

That can happen later, though.

And the impact isn't important, because I can keep running old code from old docker images to index with.

@Kirill888
Copy link
Member Author

code has been moved into apps,
odc-stats was refactored to use stac2ds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants