Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use cf-xarray heuristic to identify bound variables. Add test #54

Merged
merged 4 commits into from
Feb 29, 2024

Conversation

huard
Copy link
Collaborator

@huard huard commented Feb 27, 2024

@dchandan Replace your inference logic by the same logic used in cf-xarray

Added test

Copy link
Collaborator

@fmigneault fmigneault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks fine.
Just extra test that could be relevant.

tests/test_cmip6_datacube.py Show resolved Hide resolved
@huard huard linked an issue Feb 28, 2024 that may be closed by this pull request
Copy link
Collaborator

@dchandan dchandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a minor suggestion, but not blocking.

STACpopulator/extensions/datacube.py Show resolved Hide resolved

variables[name] = Variable(
properties=dict(
dimensions=meta["shape"],
type=VariableType.AUXILIARY.value if self.is_coordinate(attrs) else VariableType.DATA.value,
type=dtype,
description=attrs.get("description", attrs.get("long_name", "")),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a description to the bounds variable too. Something like "bounds for the time coordinate" as in #51. We can construct an appropriate description string in the if-elif-else block before this and use that here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I wanted to hear your thoughts on this.

  1. What's the rationale for "augmenting" the metadata attributes when creating the STAC items?
  2. Is the datacube extension logic the appropriate place to do this?
  3. Will it scale well to other missing infos?
  4. Are we worried about a "surprise" factor coming from having information in the STAC item that does not reflect what people will find in the actual dataset?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good questions, here's what I think:

  1. I think the rationale would be that the STAC item should be as informative as reasonably possible. If, at the STAC item generation phase adding information that doesn't exist in the file metadata achieves that goal, then I don't see why not. In general, there is no principle/constraint requiring one-to-one correspondence between metadata in a STAC representation and that in the underlying asset. Here, we just use metadata from the CMIP6 files to guide the generation of STAC items, but even if the CMIP6 files were poorly constructed with relevant metadata, we could still generate the STAC items by programmatically adding the relevant metadata at item generation time.
  2. I really don't know what is the purpose of the datacube extension. I have never used it, and I don't envision using it. But from looking at the data it represents, it seems it just described the dimensions and variable available in an asset, and if that is the case, then we definitely can use the extension to add a description to a variable that is listed by the extension.
  3. I don't see why not. Going back to point 1, I think it's all about including whatever one feels is necessary for describing the data variables and dimensions.
  4. No, I am not. Going back to point 1, there is no constraint about a one-to-one correspondence between metadata described in the STAC representation and that in an underlying file. Consider the case of some obscure dataset where the underlying .nc files do not have any metadata whatsoever (on variable level or global). There is nothing preventing me from appropriately describing the metadata in the STAC layer and make the dataset available via the catalog. In this case, none of the metadata in the STAC representation would be available in the file itself, and that is perfectly OK.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm happy adopting your perspective on 1 and 4. I'm less convinced about 2 and 3 and I believe we should spend some time thinking about the "metadata augmentation architecture" before writing too much code. I'm happy to add the description you mentioned, but I don't think this should serve as a template to add other kinds of information.

My discomfort comes from mixing process logic (NcML -> STAC) with thematic knowledge (metadata fixes) in the same code base. I think we can get the process logic fairly robust and maintenance should be minimal. The thematic knowledge will always grow with new datasets. Mixing both has the potential to generate maintenance work in the future.

@huard huard merged commit 3bd6a40 into crim-ca:master Feb 29, 2024
3 checks passed
@huard huard deleted the fix-52 branch February 29, 2024 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potentially incorrect representation of variables in data cube extension
3 participants