Skip to content

Dataset.encode_cf function #4412

@eric-czech

Description

@eric-czech

I would like to be able to apply CF encoding to an existing DataArray (or multiple in a Dataset) and then store the encoded forms elsewhere. Is this already possible?

More specifically, I would like to encode a large array of 32-bit floats as 8-bit ints and then write them to a Zarr store using rechunker.

I'm essentially after this pangeo-data/rechunker#45 (Xarray support in rechunker), but I'm looking for what functionality exists in Xarray to make it possible in the meantime.

Activity

dcherian

dcherian commented on Sep 8, 2020

@dcherian
Contributor

Not at the moment.

I think we should add an xr.encode_cf that wraps conventions.cf_encoder (this may have already come up in the "flexible backends" discussions). This would parallel xr.decode_cf

def cf_encoder(variables, attributes):
"""
Encode a set of CF encoded variables and attributes.
Takes a dicts of variables and attributes and encodes them
to conform to CF conventions as much as possible.
This includes masking, scaling, character array handling,
and CF-time encoding.
Parameters
----------
variables : dict
A dictionary mapping from variable name to xarray.Variable
attributes : dict
A dictionary mapping from attribute name to value
Returns
-------
encoded_variables : dict
A dictionary mapping from variable name to xarray.Variable,
encoded_attributes : dict
A dictionary mapping from attribute name to value
See also
--------
decode_cf_variable, encode_cf_variable
"""
# add encoding for time bounds variables if present.
_update_bounds_encoding(variables)
new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
# Remove attrs from bounds variables (issue #2921)
for var in new_vars.values():
bounds = var.attrs["bounds"] if "bounds" in var.attrs else None
if bounds and bounds in new_vars:
# see http://cfconventions.org/cf-conventions/cf-conventions.html#cell-boundaries
for attr in [
"units",
"standard_name",
"axis",
"positive",
"calendar",
"long_name",
"leap_month",
"leap_year",
"month_lengths",
]:
if attr in new_vars[bounds].attrs and attr in var.attrs:
if new_vars[bounds].attrs[attr] == var.attrs[attr]:
new_vars[bounds].attrs.pop(attr)
return new_vars, attributes

It'll also need to wrap this logic:

xarray/xarray/backends/api.py

Lines 1113 to 1127 in 66259d1

if encoding is None:
encoding = {}
variables, attrs = conventions.encode_dataset_coordinates(dataset)
check_encoding = set()
for k, enc in encoding.items():
# no need to shallow copy the variable again; that already happened
# in encode_dataset_coordinates
variables[k].encoding = enc
check_encoding.add(k)
if encoder:
variables, attrs = encoder(variables, attrs)

For simple use cases, you could write a small wrapper for .cf_encoder that takes datasets and returns datasets and it should work just fine (Look at conventions.decode_cf).

eric-czech

eric-czech commented on Sep 8, 2020

@eric-czech
Author

Ok thanks @dcherian! I'll try that (feel free to close this).

dcherian

dcherian commented on May 10, 2023

@dcherian
Contributor

Related request for to_zarr(..., encode_cf=False): #5405

This came up in the discussion today.

cc @tom-white @kmuehlbauer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @dcherian@eric-czech

        Issue actions

          Dataset.encode_cf function · Issue #4412 · pydata/xarray