Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_grid returns a file that cannot be written to netCDF #45

Closed
matt-long opened this issue Apr 8, 2020 · 6 comments
Closed

get_grid returns a file that cannot be written to netCDF #45

matt-long opened this issue Apr 8, 2020 · 6 comments

Comments

@matt-long
Copy link
Collaborator

Currently, get_grid returns a file that cannot be directly written to disk.

For example

scrip_grid_file = '/glade/scratch/mclong/tmp/grid_file.nc'
grid = 'POP_gx1v7'
ds = pop_tools.get_grid(grid, scrip=True)
ds.to_netcdf(scrip_grid_file)

Return

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-a30dfa131a8f> in <module>
      6         print(f'generating {scrip_grid_file}')
      7         ds = pop_tools.get_grid(grid, scrip=True)
----> 8         ds.to_netcdf(scrip_grid_file)

/glade/work/mclong/miniconda3/envs/marbl-forcing/lib/python3.7/site-packages/xarray/core/dataset.py in to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf)
   1552             unlimited_dims=unlimited_dims,
   1553             compute=compute,
-> 1554             invalid_netcdf=invalid_netcdf,
   1555         )
   1556 

/glade/work/mclong/miniconda3/envs/marbl-forcing/lib/python3.7/site-packages/xarray/backends/api.py in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf)
   1038     # validate Dataset keys, DataArray names, and attr keys/values
   1039     _validate_dataset_names(dataset)
-> 1040     _validate_attrs(dataset)
   1041 
   1042     try:

/glade/work/mclong/miniconda3/envs/marbl-forcing/lib/python3.7/site-packages/xarray/backends/api.py in _validate_attrs(dataset)
    212     # Check attrs on the dataset itself
    213     for k, v in dataset.attrs.items():
--> 214         check_attr(k, v)
    215 
    216     # Check attrs on each variable within the dataset

/glade/work/mclong/miniconda3/envs/marbl-forcing/lib/python3.7/site-packages/xarray/backends/api.py in check_attr(name, value)
    207                 "a string, an ndarray or a list/tuple of "
    208                 "numbers/strings for serialization to netCDF "
--> 209                 "files".format(value)
    210             )
    211 

TypeError: Invalid value for attr: {'Black Sea': -13, 'Baltic Sea': -12, 'Red Sea': -5, 'Southern Ocean': 1, 'Pacific Ocean': 2, 'Indian Ocean': 3, 'Persian Gulf': 4, 'Atlantic Ocean': 6, 'Mediterranean Sea': 7, 'Lab. Sea & Baffin Bay': 8, 'GIN Seas': 9, 'Arctic Ocean': 10, 'Hudson Bay': 11} must be a number, a string, an ndarray or a list/tuple of numbers/strings for serialization to netCDF files

Deleting the problematic attribute before writing fixes this

del ds.attrs['region_mask_regions']

We should fix this!

@klindsay28
Copy link
Collaborator

netCDF's data model for attributes is fairly simple. It supports 0D and 1D (non-labeled) arrays of values. I don't see how a python dict, indexed by strings, could be mapped into that data model. So it seems that any solution to this involves either

  1. changing the data type of region_mask_regions or
  2. change region_mask_regions from a Dataset attribute to a DataArray within the Dataset, and also possibly changing its data type.

Either changes the way users access the content of region_mask_regions.
Question 1: Is there existing code relying on how region_mask_regions is accessed?
If so, implementing 1) or 2) would break that code.
(rant regarding backwards incompatibility omitted)

I think the narrow scope of data model of attributes in netCDF makes 2) preferable.

That said, a candidate for 1) is make region_mask_regions an array of strings, sorted by absolute value of the region value. The user can then index into the array using is the absolute value of the values currently in region_mask_regions. This candidate for 1) makes the mapping region_val -> region_name straightforward (though the user needs to know to apply abs to region_val).

A candidate for 2) to introduce a mask_regions dimension with a corresponding string valued coordinate array, and have a DataArray with the region values. Here's pseudo-CDL for this:

dimensions:
        mask_regions = ... ;
        nchar = ... ;
variables:
        char mask_regions(mask_regions, nchar) ;
        int mask_region_val(mask_regions) ;

I use this approach when I store region names when I store regional timeseries (though I have no need to store the original region integer value). This candidate for 2) makes the mapping region_name -> region_val straightforward, using .sel.

If you also sort the regions by absolute value of the region value, then I think you can use isel for straightforward region_val -> region_name mapping (the user again needs to remember to apply abs to the region value).

Question 2: How is region_mask_regions used in practice? Which mapping direction do you think would be used more frequently by users? I propose that the code should make the more frequently used use case easier to use.

Question 0: What is the use case for writing the Dataset returned by pop_tools.get_grid to a netCDF file?

@matt-long
Copy link
Collaborator Author

Thanks @klindsay28.

Question 0: When generating scrip grid files for remapping using ESMF, for instance. In this case, the attribute probably shouldn't be there at all.

Question 1: I am not aware of any code the explicitly uses the attributes of the returned grid-variable dataset.

Question 2: I like your candidate solution (2) as it's most flexible and supports operations in the way you suggest.

@klindsay28
Copy link
Collaborator

FYI, the pseudo-CDL that I wrote for 2) yields mask_regions looking like a coordinate variable that has non-numeric type. It turns out that this is frowned upon by CF. They want coordinate variables to have numeric type. Instead, mask_regions needs to be a string-valued auxiliary coordinate variable. In particular, it should not have the same name as the dimension. See the lengthy conversation on this at
cf-convention/cf-conventions#174
So if you wanted the file to be CF compliant, I think you'd need to be more like

dimensions:
        mask_regions = ... ;
        nchar = ... ;
variables:
        char region_names(mask_regions, nchar) ;
        int mask_region_val(mask_regions) ;
                mask_region_val:coordinate="region_names"

See http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#_labels_and_alternative_coordinates for CF's approach to handling this.

@kmpaul
Copy link
Contributor

kmpaul commented Sep 3, 2020

I've proposed a solution with PR #64.

@kmpaul
Copy link
Contributor

kmpaul commented Sep 9, 2020

Does #64 provide a satisfactory solution to this issue? If so, I'll close this.

@mnlevy1981
Copy link
Collaborator

Does #64 provide a satisfactory solution to this issue? If so, I'll close this.

Sorry to wait two weeks to reply, but I do think #64 addressed this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants