a codec simplification plan

I propose that this code should work:

```python
# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "zarr",
#     "pytest"
# ]
# ///

import zarr
from numcodecs import GZip
import pytest

compressors = (GZip(), zarr.codecs.GzipCodec())
zarr_formats = (2,3)
@pytest.mark.parametrize('compressor', compressors)
@pytest.mark.parametrize('zarr_format', zarr_formats)
def test(compressor, zarr_format):
    x = zarr.create_array(
        {},
        shape=(10,), 
        dtype='uint8', 
        compressors=compressor, 
        zarr_format=zarr_format)

if __name__ == '__main__':
    pytest.main([__file__, f'-c {__file__}'])
```
```
../../.cache/uv/environments-v2/test-48d946355ee2ef42/lib/python3.11/site-packages/zarr/core/array.py:4723: TypeError
==================================================================================================== short test summary info ====================================================================================================
FAILED  /home/bennettd/dev/zarr-python::test[2-compressor1] - ValueError: Invalid compressor. Expected None, a numcodecs.abc.Codec, or a dict representation of a numcodecs.abc.Codec. Got <class 'zarr.codecs.gzip.GzipCodec'> instead.
FAILED  /home/bennettd/dev/zarr-python::test[3-compressor0] - TypeError: 'GZip' object is not iterable
```


Ignoring the details of the errors here, the reason those errors exist is because  our codec handling is weird and requires the use of separate objects (`zarr.codecs.GzipCodec` for v3, `numcodecs.GZip` for v2) to express the same thing (gzip compression).

Here is a proposal to fix this:

- we add methods to our codec base class which enables it to handling zarr v2 and zarr v3 metadata.  This means the exact same codec class can be used for zarr v2 or zarr v3. This is how the new dtypes work and I think it's a good design.
- We define a protocol, in this repo, that models the structure of the numcodecs codec abstract base class. That codec would look like this:
```python

from typing import Protocol, ClassVar, runtime_checkable

@runtime_checkable
class Numcodec(Protocol):
    codec_id: ClassVar[str]

    def encode(self, buf: ArrayLike) -> ArrayLike:
        ...

    def decode(self, buf: ArrayLike) -> ArrayLike:
        ...
    
    def get_config() -> Mapping[str, Object] # this return type enables typed dicts
        ...
    @classmethod
    def from_config(cls, data: Mapping[str, object]) -> Self:
        ...
```

We then define routines, like the ones in @brokkoli71's [PR](https://github.com/zarr-developers/zarr-python/pull/3037), that automatically wrap user input to handle objects that implement numcodecs.abc.codec in the respective zarr-python codec class. Because `create_array` takes separate `filters`, `serializer`, `compressors` kwargs, we know which codec class (array-array, array-bytes, byte-bytes) is the correct output for wrapping.

We can also define class methods on the Codec base class that enable construction of the codec from an implements-numcodec python object.

This will allow zarr-python to eventually drop the numcodecs requirement entirely if we see fit, without any compatibility loss. Given the anemic maintenance of numcodecs, I see this is a very good thing. 

There is one remaining concern -- how to handle codecs defined in the zarr v3 spec. For example, this case:

```python
from numcodecs import GZip
import zarr
zarr.create_array(..., compressors=GZip(), zarr_format=3)
```

Here we should inspect the `codec_id` attribute of the user-provided codec, and see if that codec is one of the core codecs enshrined in the spec. if so, we should replace the user-provided codec with the one defined in the zarr spec. This ensures that, even if something changes about the codec configuration in numcodecs, zarr-python does not propagate invalid metadata. Users who want to circumvent this behavior can either subclass the zarr-python codec classes, or use a lower-level array constructor that doesn't do these checks. 

Together these changes will achieve the following goals:
- allow users to use `numcodecs`-compatible codecs with zarr v3
- simplify our codebase
- protect users from surprising numcodecs changes that can cause compatibility issues (examples: https://github.com/zarr-developers/numcodecs/pull/519,  https://github.com/zarr-developers/numcodecs/pull/713). Numcodecs does not offer any compatibility guarantees for its codecs. Zarr python needs to be stricter.
- weaken our reliance on numcodecs, potentially allowing us to declare it an optional dependency


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

a codec simplification plan #3162

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

a codec simplification plan #3162

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions