Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for updating cf aggregation files #475

Closed
bnlawrence opened this issue Oct 21, 2022 · 2 comments · Fixed by #630
Closed

support for updating cf aggregation files #475

bnlawrence opened this issue Oct 21, 2022 · 2 comments · Fixed by #630
Labels
aggregation Rerlating to metadata-based field and domain aggregation CFA Relating to CFA datasets enhancement New feature or request
Milestone

Comments

@bnlawrence
Copy link

Currently with CFA aggregation, if new fragment files are added to a directory, it is necessary to read all the fragments to (re)build a new aggregation file. Not only is that an expensive operation if there are lot of fragments, it is possible we might want to add local files to an aggregation which has remote fragments.

So: If we add new files to a directory, how do we want to update the aggregation, and avoid running it again?

Use Case 1:

I ran a model which wrote output every month, and every year of simulation, I run an aggregation, and write data to tape. Then the next year of data comes. i want to have one aggregated field variable?

  • Hopefully that's relatively straight forward, I run an aggregation on the existing aggregation and the new files (the old files are now on tape), and create a new consolidated aggregation file with no need for the old aggregation file.

Use Case 2

As above, but the data is updating on disk, so I keep getting more data in the directory. We want the new aggregation to only touch the new data and use the old aggregation.

  • I think that would be enabled by having "last aggregated" as a time in the aggregation file, and using that to not touch any fragment which precedes that in the aggregation?
@bnlawrence bnlawrence added the enhancement New feature or request label Oct 21, 2022
@davidhassell davidhassell added the CFA Relating to CFA datasets label Oct 29, 2022
@davidhassell
Copy link
Collaborator

If all domain metadata (i.e. dimension and auxiliary coordinates, domain ancillaries, and cell measures) are stored as normal (non-aggregated) netCDF variables, then this will be no problem. You can just read the original aggregation and the new files, feed it all to cf.aggregate do its work, and then write out a new CFA file. The aggregation only needs to know the domain metadata, so if that's all in the original aggregation file then it won't need to touch the original fragments.

If some domain metadata in the original CFA file are represented as aggregated variables, then the above would still work, but it would have to read the original fragments. In that case, we could as suggested above, look at some special "trust me I know what I'm doing" bypassing of checks in cf.aggregate. E.g. there is already the wonderfully named donotchecknonaggregatingaxes keyword to cf.aggregate, that ended up with that name mainly to discourage it's use!

Aside (I want to capture this thought somewhere, that's all) - we would need to know how to analyse a concatenated dask graph to ascertain that all of its storage chunks correspond to whole, unaltered chunks that are suitable for CFA-ing. I'm wondering if we shouldn't keep our own record outside of dask of fragments along fragment dimensions, as perhaps there are just too many ways dask can reasonably represent CFA-able data.

@davidhassell davidhassell linked a pull request Apr 24, 2023 that will close this issue
@davidhassell
Copy link
Collaborator

This is all fixed by #630. New files are added by aggregating an existing CFA file with the files, and then writing out the result as a new CFA file.

@davidhassell davidhassell added the aggregation Rerlating to metadata-based field and domain aggregation label Apr 24, 2023
@davidhassell davidhassell added this to the 3.15.0 milestone Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aggregation Rerlating to metadata-based field and domain aggregation CFA Relating to CFA datasets enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants