Task: unify `odc.stac.transform` with `odc.stac._eo3` #355

Kirill888 · 2021-09-07T01:06:18Z

Problem Description

Module odc.stac.transform (previously odc.index.stac) was developed before pystac was available and before STAC 1.0 was finalized. It's purpose is to translate a STAC document to EO3 compatible Dataset definition document suitable for indexing to datacube.

There are some issues with the current implementation that I would like to address

hard-coded assumptions about some collections of interest
direct access into STAC document to extract information (use pystac for better robustness)
doesn't work with non-proj STAC items (The STAC transform should allow items with no Proj information #297)

Module odc.stac._eo3 does similar thing, except the goal was to produce Dataset objects suitable for calling dc.load with, rather than a yaml document suitable for indexing. As such it's missing some of the capabilities required by odc.stac.transform, such as deterministic UUID generation, lineage extraction, region code and other metadata massaging.

Sub-tasks

Implement deterministic UUID generation in odc.stac._eo3 (currently using random)
1. Use .id as is if it contains UUID (DEA datasets)
2. Use .id + .collection_id + (optional other fields configured by user per collection) to compute deterministic UUID
Add support for remapping properties during dataset construction
Switch odc.stac.transform to use odc.stac.stac2ds possibly with some further metadata tweaking post conversion (region code, product href, lineage)

Note that deterministic UUIDs have potential to benefit stac_load as well when used with Dask. Non-remapping of properties is probably a bug, as time ranges are probably broken currently: EO3 metadata looks for old names for end_datetime,start_datetime.

CC: @alexgleith @gadomski

The text was updated successfully, but these errors were encountered:

Unless random UUIDs are requested generate the same UUID for STAC items with the same id from the same collection.

EO3 uses some pre-STAC 1.0 properties to lookup some keys, this is particularly important for star/end dates.

Unless random UUIDs are requested generate the same UUID for STAC items with the same id from the same collection.

EO3 uses some pre-STAC 1.0 properties to lookup some keys, this is particularly important for star/end dates.

Unless random UUIDs are requested generate the same UUID for STAC items with the same id from the same collection.

EO3 uses some pre-STAC 1.0 properties to lookup some keys, this is particularly important for star/end dates.

Unless random UUIDs are requested generate the same UUID for STAC items with the same id from the same collection.

EO3 uses some pre-STAC 1.0 properties to lookup some keys, this is particularly important for star/end dates.

Kirill888 · 2021-10-04T22:50:30Z

progress so far

UUID generation is now deterministic and sufficiently configurable, default UUID resolution goes like this

Check if id is already in UUID format, in which case use that (this is for ODC generated datasets)
Use Item id and collection_id fields from STAC to generate deterministic UUID
[optional] user can configure extra fields that should be included as well, those must come from properties section of the Item

Property name remapping to match eo3 is done.

still to do

customizing product name (easy to add)
deal with lineage in ODC format only (should not be too complicated)
understand what difference are in documents produced by stac/_eo3.py vs stac/transform.py, and what impact that has
- for example deterministic UUID produced by _eo3.py will be different from what is done in transform.py, does it matter (@alexgleith)?
- Handling of "brokenness" in STAC items, for example E84 Sentinel-2 on AWS currently doesn't specify proj extension using STAC 1.0 syntax and so seen by _eo3.py as not having proj data, but transform.py looks for proj data inside STAC item dict without checking extension so it can access it.
Will probably need more customization hooks to support behaviour of transform.py

Kirill888 · 2021-10-05T01:19:00Z

Another missing feature in _eo3.py is "common prefix extraction" for assets. For data loading purposes it's fine to just use "absolute path" to refer to files, but for indexing a "relocatable" representation is desired. This involves translating asset locations into a "relative" representation where possible, it's possible when all assets of interest reside under common prefix.

alexgleith · 2021-10-05T03:23:05Z

Having different deterministic UUIDs will break things for me, yeah. If I've indexed Sentinel-2 or Landsat 8 data, which doesn't come with a UUID already, I'm relying on that being consistent to know if it's already in the DB.

What's wrong with the current deterministic ID?

Kirill888 · 2021-10-05T03:48:32Z

It's not generic enough, special rules for specific product names:

odc-tools/libs/stac/odc/stac/transform.py

Lines 293 to 303 in 6c7a8bf

    
           if _check_valid_uuid(input_stac["id"]): 
        
               deterministic_uuid = input_stac["id"] 
        
           else: 
        
               if product_name in ["s2_l2a"]: 
        
                   deterministic_uuid = str( 
        
                       odc_uuid("sentinel-2_stac_process", "1.0.0", [dataset_id]) 
        
                   ) 
        
               else: 
        
                   deterministic_uuid = str( 
        
                       odc_uuid(f"{product_name}_stac_process", "1.0.0", [dataset_id]) 
        
                   )

Also I don't think using sources= is appropriate, sources are for lineage, stac.id is not lineage, it is the dataset.

I guess we can allow user-supplied uuid generation function, in here:

odc-tools/libs/stac/odc/stac/_eo3.py

Lines 478 to 497 in 6c7a8bf

    
           def _compute_uuid( 
        
               item: pystac.Item, mode: str = "auto", extras: Optional[Sequence[str]] = None 
        
           ) -> uuid.UUID: 
        
               if mode == "native": 
        
                   return uuid.UUID(item.id) 
        
               if mode == "random": 
        
                   return uuid.uuid4() 
        
               assert mode == "auto" 
        
               # 1. see if .id is already a UUID 
        
               try: 
        
                   return uuid.UUID(item.id) 
        
               except ValueError: 
        
                   pass 
        
               # 2. .id, .collection_id, [extras] 
        
               _extras = ( 
        
                   {} if extras is None else {key: item.properties.get(key, "") for key in extras} 
        
               ) 
        
               return odc_uuid(item.collection_id, "stac", [], stac_id=item.id, **_extras)

alexgleith · 2021-10-05T04:58:29Z

I guess having a user-defined function for UUID generation is a good workaround. It'll need to be added as a parameter for the dc_tools suite.

That can happen later, though.

And the impact isn't important, because I can keep running old code from old docker images to index with.

Kirill888 · 2021-10-12T05:17:16Z

code has been moved into apps,
odc-stats was refactored to use stac2ds

Kirill888 added a commit that referenced this issue Sep 7, 2021

Generate UUIDs in deterministic fashion (#355)

05efb94

Unless random UUIDs are requested generate the same UUID for STAC items with the same id from the same collection.

Kirill888 added a commit that referenced this issue Sep 7, 2021

Generate UUIDs in deterministic fashion (#355)

feeae9e

Unless random UUIDs are requested generate the same UUID for STAC items with the same id from the same collection.

Kirill888 added a commit that referenced this issue Sep 7, 2021

Rename some properties to conform to EO3 expectations (#355)

719751b

EO3 uses some pre-STAC 1.0 properties to lookup some keys, this is particularly important for star/end dates.

Kirill888 added a commit to Kirill888/odc-tools that referenced this issue Sep 7, 2021

Generate UUIDs in deterministic fashion (opendatacube#355)

18bc31d

Unless random UUIDs are requested generate the same UUID for STAC items with the same id from the same collection.

Kirill888 added a commit to Kirill888/odc-tools that referenced this issue Sep 7, 2021

Rename some properties to conform to EO3 expectations (opendatacube#355)

4c9ada2

EO3 uses some pre-STAC 1.0 properties to lookup some keys, this is particularly important for star/end dates.

Kirill888 added a commit that referenced this issue Sep 7, 2021

Generate UUIDs in deterministic fashion (#355)

f6fda6b

Unless random UUIDs are requested generate the same UUID for STAC items with the same id from the same collection.

Kirill888 added a commit that referenced this issue Sep 7, 2021

Rename some properties to conform to EO3 expectations (#355)

f74e210

EO3 uses some pre-STAC 1.0 properties to lookup some keys, this is particularly important for star/end dates.

Kirill888 added a commit that referenced this issue Sep 7, 2021

Generate UUIDs in deterministic fashion (#355)

74cc152

Unless random UUIDs are requested generate the same UUID for STAC items with the same id from the same collection.

Kirill888 added a commit that referenced this issue Sep 7, 2021

Rename some properties to conform to EO3 expectations (#355)

e799c9c

EO3 uses some pre-STAC 1.0 properties to lookup some keys, this is particularly important for star/end dates.

Kirill888 changed the title ~~Task: unfiy odc.stac.transform with odc.stac._eo3~~ Task: unify odc.stac.transform with odc.stac._eo3 Oct 4, 2021

Kirill888 mentioned this issue Oct 5, 2021

Task: extract odc-stac package into a separate repo #415

Closed

Kirill888 closed this as completed Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task: unify `odc.stac.transform` with `odc.stac._eo3` #355

Task: unify `odc.stac.transform` with `odc.stac._eo3` #355

Kirill888 commented Sep 7, 2021

Kirill888 commented Oct 4, 2021

Kirill888 commented Oct 5, 2021

alexgleith commented Oct 5, 2021

Kirill888 commented Oct 5, 2021

alexgleith commented Oct 5, 2021

Kirill888 commented Oct 12, 2021

Task: unify odc.stac.transform with odc.stac._eo3 #355

Task: unify odc.stac.transform with odc.stac._eo3 #355

Comments

Kirill888 commented Sep 7, 2021

Problem Description

Sub-tasks

Kirill888 commented Oct 4, 2021

progress so far

still to do

Kirill888 commented Oct 5, 2021

alexgleith commented Oct 5, 2021

Kirill888 commented Oct 5, 2021

alexgleith commented Oct 5, 2021

Kirill888 commented Oct 12, 2021

Task: unify `odc.stac.transform` with `odc.stac._eo3` #355

Task: unify `odc.stac.transform` with `odc.stac._eo3` #355