Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What options do I have for <U# type data saving in zarr? #10077

Open
doronbehar opened this issue Feb 25, 2025 · 4 comments
Open

What options do I have for <U# type data saving in zarr? #10077

doronbehar opened this issue Feb 25, 2025 · 4 comments
Labels
topic-zarr Related to zarr storage library

Comments

@doronbehar
Copy link

What is your issue?

So I tried out xarray today with zarr version 3.0.4, and encountered these scary warnings:

/nix/store/qasysgiacqplrbda5yl65wg7jrs0gcjl-python3-3.12.9-env/lib/python3.12/site-packages/zarr/codecs/vlen_utf8.py:44: UserWarning: The codec `vlen-utf8` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  return cls(**configuration_parsed)
/nix/store/qasysgiacqplrbda5yl65wg7jrs0gcjl-python3-3.12.9-env/lib/python3.12/site-packages/zarr/core/array.py:3991: UserWarning: The dtype `<U5` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  meta = AsyncArray._create_metadata_v3(
/nix/store/qasysgiacqplrbda5yl65wg7jrs0gcjl-python3-3.12.9-env/lib/python3.12/site-packages/zarr/api/asynchronous.py:203: UserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  warnings.warn(

A MWE is:

import xarray as xr
import numpy as np

xr.DataArray(np.array([
    "hello",
    "world",
])).to_zarr("test_utf8_strings.zarr")

Is <U5 a variable length utf8 type? It shouldn't be... Also, what are my alternatives?

@doronbehar doronbehar added the needs triage Issue that has not been reviewed by xarray team member label Feb 25, 2025
@shoyer
Copy link
Member

shoyer commented Mar 5, 2025

Is <U5 a variable length utf8 type? It shouldn't be...

This is the easy question! It's a fixed length UTF string, but I believe Zarr does encode it as UTF-8.

For what it's worth, the future proof way to create NumPy arrays of UTF-8 data is to use the UTF-8 string dtype (np.dtypes.StringDType, which requires numpy v2). However, this is not (yet) the default in NumPy or Xarray.

Also, what are my alternatives?

You can write Zarr v2 files by passing zarr_version=2, which will silence most of these warnings, but not really resolve them, given that these are non-standard Zarr v2 conventions, too.

Otherwise, Zarr v3 needs a way to silence these warnings. And perhaps an advocate to push through the Zarr standardization process :).

@shoyer shoyer added topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Mar 5, 2025
@jhamman
Copy link
Member

jhamman commented Mar 5, 2025

@d-v-b is working on expanding the dtype story in zarr3 now -- including fixed-length-strings. Expect an update within a month or so here.

@doronbehar
Copy link
Author

For what it's worth, the future proof way to create NumPy arrays of UTF-8 data is to use the UTF-8 string dtype (np.dtypes.StringDType, which requires numpy v2). However, this is not (yet) the default in NumPy or Xarray.

If you mean by "not the default" that np.array(["hello", "world"]) without explicitly specifying a dtype argument, doesn't use np.dtypes.StringDType, but uses <U5 by default, then I understand what you are saying. However, personally I don't think it should be the default :). Also, just to clear out a bit of ambiguity I found in that sentence, I tried:

xr.DataArray(np.array(
    ["hello", "world"],
    dtype=np.dtypes.StringDType,
)).to_zarr("test_utf8_strings.zarr")

And it miserably failed:

Traceback (most recent call last):
  File "/home/doron/repos/lab-ion-trap-simulations/./t.py", line 9, in <module>
    )).to_zarr("test_utf8_strings.zarr")
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/core/dataarray.py", line 4428, in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/backends/api.py", line 2216, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/backends/api.py", line 1952, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/backends/zarr.py", line 1022, in store
    self.set_variables(
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/backends/zarr.py", line 1194, in set_variables
    zarr_array = self._create_new_array(
                 ^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/backends/zarr.py", line 1089, in _create_new_array
    zarr_array = self.zarr_group.create(
                 ^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/hierarchy.py", line 1195, in create
    return self._write_op(self._create_nosync, name, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/hierarchy.py", line 952, in _write_op
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/hierarchy.py", line 1201, in _create_nosync
    return create(store=self._store, path=path, chunk_store=self._chunk_store, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/creation.py", line 209, in create
    init_array(
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/storage.py", line 455, in init_array
    _init_array_metadata(
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/storage.py", line 584, in _init_array_metadata
    raise ValueError("missing object_codec for object array")
ValueError: missing object_codec for object array

The above was obtained with Zarr v2. With Zarr v3, I got the same warnings as in the top level comment of this issue.

@d-v-b is working on expanding the dtype story in zarr3 now -- including fixed-length-strings. Expect an update within a month or so here.

OK That's comforting, thanks :).

@shoyer
Copy link
Member

shoyer commented Mar 6, 2025

For what it's worth, the future proof way to create NumPy arrays of UTF-8 data is to use the UTF-8 string dtype (np.dtypes.StringDType, which requires numpy v2). However, this is not (yet) the default in NumPy or Xarray.

If you mean by "not the default" that np.array(["hello", "world"]) without explicitly specifying a dtype argument, doesn't use np.dtypes.StringDType, but uses <U5 by default, then I understand what you are saying.

Yes, this is how things currently work.

However, personally I don't think it should be the default :).

I agree, UTF-8 would be a much saner default! It's just a relatively new NumPy feature, and NumPy is very conservative about making breaking changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

3 participants