Description
What happened:
It appears that either to_zarr
or from_zarr
is incorrectly concatenating the trailing dimension of single byte/character arrays and dropping the last dimension:
import xarray as xr
import numpy as np
xr.set_options(display_style='text')
chrs = np.array([
['A', 'B'],
['C', 'D'],
['E', 'F'],
], dtype='S1')
ds = xr.Dataset(dict(x=(('dim0', 'dim1'), chrs)))
ds.x
<xarray.DataArray 'x' (dim0: 3, dim1: 2)>
array([[b'A', b'B'],
[b'C', b'D'],
[b'E', b'F']], dtype='|S1')
Dimensions without coordinates: dim0, dim1
ds.to_zarr('/tmp/test.zarr', mode='w')
xr.open_zarr('/tmp/test.zarr').x.compute()
# The second dimension is lost and the values end up being concatenated
<xarray.DataArray 'x' (dim0: 3)>
array([b'AB', b'CD', b'EF'], dtype='|S2')
Dimensions without coordinates: dim0
For N columns in a 2D array, you end up with an "|SN" 1D array. When using say "S2" or any fixed-length greater than 1, it doesn't happen.
Interestingly though, it only affects the trailing dimension. I.e. if you use 3 dimensions, you get a 2D result with the 3rd dimension dropped:
chrs = np.array([[
['A', 'B'],
['C', 'D'],
['E', 'F'],
]], dtype='S1')
ds = xr.Dataset(dict(x=(('dim0', 'dim1', 'dim2'), chrs)))
ds
<xarray.Dataset>
Dimensions: (dim0: 1, dim1: 3, dim2: 2)
Dimensions without coordinates: dim0, dim1, dim2
Data variables:
x (dim0, dim1, dim2) |S1 b'A' b'B' b'C' b'D' b'E' b'F'
ds.to_zarr('/tmp/test.zarr', mode='w')
xr.open_zarr('/tmp/test.zarr').x.compute()
# `dim2` is gone and the data concatenated to `dim1`
<xarray.DataArray 'x' (dim0: 1, dim1: 3)>
array([[b'AB', b'CD', b'EF']], dtype='|S2')
Dimensions without coordinates: dim0, dim1
In short, this only affects the "S1" data type. "U1" is fine as is "SN" where N > 1.
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-42-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: None
xarray: 0.16.0
pandas: 1.0.5
numpy: 1.19.0
scipy: 1.5.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.21.0
distributed: 2.21.0
matplotlib: 3.3.0
cartopy: None
seaborn: 0.10.1
numbagg: None
pint: None
setuptools: 47.3.1.post20200616
pip: 20.1.1
conda: 4.8.2
pytest: 5.4.3
IPython: 7.15.0
sphinx: 3.2.1