Support for generating a set of tracking-ids from a slicing operation into an aggregation. #451

bnlawrence · 2022-09-22T14:06:09Z

Consider the following use case:

A cf aggregation points to 365 daily files each of which has a high resolution 3D grid for a variable for 24 hours.
A user does a cf.within (or any other sort of valid slice) into that aggregation to extract a mean value of a particular variable over a week.
The calculation will touch seven files. We can think of those seven files as the necessary data to reproduce the calculation. So these are the digital artifacts we want to save for reproduction, identify in a workflow, and cite in a paper.

(This is obviously a trivial case, it gets more interesting, if say, these are calculations carried out across ensembles from multiple institutions).

The feature request is that

the aggregation metadata includes the tracking-ids (if they are present) in such a way that the same cf.within (or other slice) can return a set of tracking-ids which can be added to a list of "provenance sources" ... (so potentially a series of cf calculations can generate a list of all the files needed for reproduction (and/or citation).
cf-python supports the use of the slicing operation so it does do this.

The text was updated successfully, but these errors were encountered:

sadielbartholomew · 2022-09-22T16:13:51Z

Thanks @bnlawrence, great write up relating to what we all discussed today.

No comments as yet, since I will need time to think over this and study some background, but just FYI for now, I'm ~~creating a cf-store label~~ assigning the 'CFA' label to mark these issues so we can pick them out easier from the Issue Tracker, etc.

davidhassell · 2022-11-14T12:11:58Z

This is resolvable with NCAS-CMS/cfa-conventions#41. With this change to the CFA conventions, cf-python could automatically create auxiliary coordinate constructs from any non-standardised aggregation metadata, making it available for slicing. With cf.read creating this new auxiliary coordinate construct, all of the cf-python machinery kicks in unchanged. E.g.we were to read the file from the new CFA example 1b:

>>> f = cf.read('example_1b.nc')[0]  # aggregated array has 12 months split over two files
>>> f.coord('long_name=tracking_id')
<CF AuxiliaryCoordinate: long_name=tracking_id(12, 1, 73, 144) >

# Each element of "f" has a tracking_id, but there are only two different values
>>> print(f.coord('long_name=tracking_id').array[:, 0, 0, 0]) 
[[[['764489ad-7bee-4228' '764489ad-7bee-4228' '764489ad-7bee-4228'
    '764489ad-7bee-4228' '764489ad-7bee-4228' '764489ad-7bee-4228'
    'a4f8deb3-fae1-26b6' 'a4f8deb3-fae1-26b6' 'a4f8deb3-fae1-26b6'
    'a4f8deb3-fae1-26b6' 'a4f8deb3-fae1-26b6' 'a4f8deb3-fae1-26b6']]]]

# Find unique tracking IDs
>>> print(f.coord('long_name=tracking_id').data.unique())
<CF Data(2): [764489ad-7bee-4228, a4f8deb3-fae1-26b]>

# Find unique tracking IDs corresponding to a subspace: 
>>> g = f.subspace(T=cf.wi(cf.dt('1959-12-01'), cf.dt('1960-03-01'))))
>>> print(g.coord('long_name%tracking_id').data.unique())
<CF Data(1): [764489ad-7bee-4228]>

Memory storage wise, this is cheap, because each fragment's tracking ID array will be cf.FullArray instance, which just stores the scalar common to that fragment. However, when we come to get the (unique) values, the array will be expanded in memory into the full shape of the subspace. This will managed by dask, though, so will always work, but would not be as efficient as we might imagine.

davidhassell · 2023-04-24T10:40:14Z

This is all implemented in #630

davidhassell · 2023-04-24T15:58:18Z

Closing now #630 is merged.

bnlawrence added the enhancement New feature or request label Sep 22, 2022

sadielbartholomew added aggregation Rerlating to metadata-based field and domain aggregation CFA Relating to CFA datasets labels Sep 22, 2022

davidhassell removed the aggregation Rerlating to metadata-based field and domain aggregation label Nov 14, 2022

davidhassell mentioned this issue Apr 4, 2023

Changes to implement CFA-0.6 #630

Merged

davidhassell added this to the 3.15.0 milestone Apr 24, 2023

davidhassell added the aggregation Rerlating to metadata-based field and domain aggregation label Apr 24, 2023

davidhassell linked a pull request Apr 24, 2023 that will close this issue

Changes to implement CFA-0.6 #630

Merged

davidhassell closed this as completed Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for generating a set of tracking-ids from a slicing operation into an aggregation. #451

Support for generating a set of tracking-ids from a slicing operation into an aggregation. #451

bnlawrence commented Sep 22, 2022

sadielbartholomew commented Sep 22, 2022 •

edited

Loading

davidhassell commented Nov 14, 2022 •

edited

Loading

davidhassell commented Apr 24, 2023

davidhassell commented Apr 24, 2023

Support for generating a set of tracking-ids from a slicing operation into an aggregation. #451

Support for generating a set of tracking-ids from a slicing operation into an aggregation. #451

Comments

bnlawrence commented Sep 22, 2022

sadielbartholomew commented Sep 22, 2022 • edited Loading

davidhassell commented Nov 14, 2022 • edited Loading

davidhassell commented Apr 24, 2023

davidhassell commented Apr 24, 2023

sadielbartholomew commented Sep 22, 2022 •

edited

Loading

davidhassell commented Nov 14, 2022 •

edited

Loading