-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use cf-xarray heuristic to identify bound variables. Add test #54
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks fine.
Just extra test that could be relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just a minor suggestion, but not blocking.
|
||
variables[name] = Variable( | ||
properties=dict( | ||
dimensions=meta["shape"], | ||
type=VariableType.AUXILIARY.value if self.is_coordinate(attrs) else VariableType.DATA.value, | ||
type=dtype, | ||
description=attrs.get("description", attrs.get("long_name", "")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a description to the bounds variable too. Something like "bounds for the time coordinate" as in #51. We can construct an appropriate description string in the if-elif-else
block before this and use that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I wanted to hear your thoughts on this.
- What's the rationale for "augmenting" the metadata attributes when creating the STAC items?
- Is the datacube extension logic the appropriate place to do this?
- Will it scale well to other missing infos?
- Are we worried about a "surprise" factor coming from having information in the STAC item that does not reflect what people will find in the actual dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good questions, here's what I think:
- I think the rationale would be that the STAC item should be as informative as reasonably possible. If, at the STAC item generation phase adding information that doesn't exist in the file metadata achieves that goal, then I don't see why not. In general, there is no principle/constraint requiring one-to-one correspondence between metadata in a STAC representation and that in the underlying asset. Here, we just use metadata from the CMIP6 files to guide the generation of STAC items, but even if the CMIP6 files were poorly constructed with relevant metadata, we could still generate the STAC items by programmatically adding the relevant metadata at item generation time.
- I really don't know what is the purpose of the datacube extension. I have never used it, and I don't envision using it. But from looking at the data it represents, it seems it just described the dimensions and variable available in an asset, and if that is the case, then we definitely can use the extension to add a description to a variable that is listed by the extension.
- I don't see why not. Going back to point 1, I think it's all about including whatever one feels is necessary for describing the data variables and dimensions.
- No, I am not. Going back to point 1, there is no constraint about a one-to-one correspondence between metadata described in the STAC representation and that in an underlying file. Consider the case of some obscure dataset where the underlying .nc files do not have any metadata whatsoever (on variable level or global). There is nothing preventing me from appropriately describing the metadata in the STAC layer and make the dataset available via the catalog. In this case, none of the metadata in the STAC representation would be available in the file itself, and that is perfectly OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'm happy adopting your perspective on 1 and 4. I'm less convinced about 2 and 3 and I believe we should spend some time thinking about the "metadata augmentation architecture" before writing too much code. I'm happy to add the description you mentioned, but I don't think this should serve as a template to add other kinds of information.
My discomfort comes from mixing process logic (NcML -> STAC) with thematic knowledge (metadata fixes) in the same code base. I think we can get the process logic fairly robust and maintenance should be minimal. The thematic knowledge will always grow with new datasets. Mixing both has the potential to generate maintenance work in the future.
@dchandan Replace your inference logic by the same logic used in cf-xarray
Added test