use cf-xarray heuristic to identify bound variables. Add test #54

huard · 2024-02-27T21:56:47Z

@dchandan Replace your inference logic by the same logic used in cf-xarray

Added test

fmigneault

Code looks fine.
Just extra test that could be relevant.

tests/test_cmip6_datacube.py

dchandan

Looks good. Just a minor suggestion, but not blocking.

STACpopulator/extensions/datacube.py

dchandan · 2024-02-28T17:09:36Z

STACpopulator/extensions/datacube.py


            variables[name] = Variable(
                properties=dict(
                    dimensions=meta["shape"],
-                    type=VariableType.AUXILIARY.value if self.is_coordinate(attrs) else VariableType.DATA.value,
+                    type=dtype,
                    description=attrs.get("description", attrs.get("long_name", "")),


I think we should add a description to the bounds variable too. Something like "bounds for the time coordinate" as in #51. We can construct an appropriate description string in the if-elif-else block before this and use that here.

So I wanted to hear your thoughts on this.

What's the rationale for "augmenting" the metadata attributes when creating the STAC items?

Is the datacube extension logic the appropriate place to do this?

Will it scale well to other missing infos?

Are we worried about a "surprise" factor coming from having information in the STAC item that does not reflect what people will find in the actual dataset?

Good questions, here's what I think:

I think the rationale would be that the STAC item should be as informative as reasonably possible. If, at the STAC item generation phase adding information that doesn't exist in the file metadata achieves that goal, then I don't see why not. In general, there is no principle/constraint requiring one-to-one correspondence between metadata in a STAC representation and that in the underlying asset. Here, we just use metadata from the CMIP6 files to guide the generation of STAC items, but even if the CMIP6 files were poorly constructed with relevant metadata, we could still generate the STAC items by programmatically adding the relevant metadata at item generation time.

I really don't know what is the purpose of the datacube extension. I have never used it, and I don't envision using it. But from looking at the data it represents, it seems it just described the dimensions and variable available in an asset, and if that is the case, then we definitely can use the extension to add a description to a variable that is listed by the extension.

I don't see why not. Going back to point 1, I think it's all about including whatever one feels is necessary for describing the data variables and dimensions.

No, I am not. Going back to point 1, there is no constraint about a one-to-one correspondence between metadata described in the STAC representation and that in an underlying file. Consider the case of some obscure dataset where the underlying .nc files do not have any metadata whatsoever (on variable level or global). There is nothing preventing me from appropriately describing the metadata in the STAC layer and make the dataset available via the catalog. In this case, none of the metadata in the STAC representation would be available in the file itself, and that is perfectly OK.

Ok, I'm happy adopting your perspective on 1 and 4. I'm less convinced about 2 and 3 and I believe we should spend some time thinking about the "metadata augmentation architecture" before writing too much code. I'm happy to add the description you mentioned, but I don't think this should serve as a template to add other kinds of information.

My discomfort comes from mixing process logic (NcML -> STAC) with thematic knowledge (metadata fixes) in the same code base. I think we can get the process logic fairly robust and maintenance should be minimal. The thematic knowledge will always grow with new datasets. Mixing both has the potential to generate maintenance work in the future.

use cf-xarray heuristic to identify bound variables. Add test

a585d9e

huard requested review from fmigneault and dchandan as code owners February 27, 2024 21:56

fmigneault reviewed Feb 28, 2024

View reviewed changes

tests/test_cmip6_datacube.py Show resolved Hide resolved

added checks in test_auxiliary_variables

2bc0b21

huard linked an issue Feb 28, 2024 that may be closed by this pull request

Potentially incorrect representation of variables in data cube extension #52

Closed

dchandan approved these changes Feb 28, 2024

View reviewed changes

fmigneault approved these changes Feb 28, 2024

View reviewed changes

huard added 2 commits February 29, 2024 14:08

merge

4df99fb

added bounds description

c99e29f

huard merged commit 3bd6a40 into crim-ca:master Feb 29, 2024
3 checks passed

huard deleted the fix-52 branch February 29, 2024 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use cf-xarray heuristic to identify bound variables. Add test #54

use cf-xarray heuristic to identify bound variables. Add test #54

huard commented Feb 27, 2024 •

edited

Loading

fmigneault left a comment •

edited

Loading

dchandan left a comment

dchandan Feb 28, 2024

huard Feb 28, 2024

dchandan Feb 28, 2024

huard Feb 28, 2024

use cf-xarray heuristic to identify bound variables. Add test #54

use cf-xarray heuristic to identify bound variables. Add test #54

Conversation

huard commented Feb 27, 2024 • edited Loading

fmigneault left a comment • edited Loading

Choose a reason for hiding this comment

dchandan left a comment

Choose a reason for hiding this comment

dchandan Feb 28, 2024

Choose a reason for hiding this comment

huard Feb 28, 2024

Choose a reason for hiding this comment

dchandan Feb 28, 2024

Choose a reason for hiding this comment

huard Feb 28, 2024

Choose a reason for hiding this comment

huard commented Feb 27, 2024 •

edited

Loading

fmigneault left a comment •

edited

Loading