Question about loadable_variables for leap-year daily-data concatenation

Hey! I was very interested reading about this package as I have always wanted a way to "cache" open_mfdataset from xarray. My use case is that I have big datacubes for climate data sharing the exact same grid and data_vars, but they are divided into yearly zarr stores. As you probably can tell, I get an issue when I try to concatenate these stores into a virtual dataset because the chunk shape is inconsistent in leap-years (8784 vs 8760). In the [FAQ](https://virtualizarr.readthedocs.io/en/stable/faq.html#why-would-i-want-to-load-variables-using-loadable_variables) it is written that: 

> Some of your variables have inconsistent-length chunks, and you want to be able to concatenate them together. For example you might have multiple virtual datasets with coordinates of inconsistent length (e.g., leap years within multi-year daily data). Loading them allows you to rechunk them however you like.

but I could not really get this to work. I am trying to create a minimal example - and would love to contribute with this as example documentation in a jupyter notebook if you'd like - but I am not sure if it is possible to do what I want here.  

I have created a mini example to reproduce a similar use case for what I want. First, create couple of dummy HDF datasets with different time-axis length:

```python
from datetime import datetime, timedelta
import numpy as np

january_times = [datetime(2025, 1, 1) + timedelta(hours=i) for i in range(0, 744)]
february_times = [datetime(2025, 2, 1) + timedelta(hours=i) for i in range(0, 672)]
lat = np.linspace(0, 10, 10)
lon = np.linspace(0, 10, 10)

ds_january = xr.Dataset(
    coords=dict(time=january_times, lon=lon, lat=lat),
    data_vars=dict(temperature=(("time", "lon", "lat"), np.ones((744, 10, 10)))),
).chunk(dict(time=-1)).to_netcdf("january.nc")

ds_february = xr.Dataset(
    coords=dict(time=february_times, lon=lon, lat=lat),
    data_vars=dict(temperature=(("time", "lon", "lat"), np.ones((672, 10, 10)))),
).chunk(dict(time=-1)).to_netcdf("february.nc")
```

Then try to concat them using VirtualiZarr and xarray:

```python
from obstore.store import LocalStore

from virtualizarr import open_virtual_dataset
from virtualizarr.parsers import HDFParser
from virtualizarr.registry import ObjectStoreRegistry
import xarray as xr

from pathlib import Path

store_path = Path.cwd()
file_paths = [str(store_path / "january.nc"), str(store_path / "february.nc")]
file_urls = [f"file://{file_path}" for file_path in file_paths]

store = LocalStore(prefix=store_path)
registry = ObjectStoreRegistry({file_url: store for file_url in file_urls})
parser = HDFParser()
datasets = []
for url in file_urls:
    vds = open_virtual_dataset(
        url=url, parser=parser, registry=registry, loadable_variables=["time", "lat", "lon"]
    )
    datasets.append(vds)

mds = xr.concat(datasets, dim="time")
mds
```

This raises the exception:

> ValueError: Cannot concatenate arrays with inconsistent chunk shapes: (672, 10, 10) vs (744, 10, 10) .Requires ZEP003 (Variable-length Chunks).

I have tried to re-chunk the vds before concat, but not yet figured out how this should be done. Is what I try to do here possible?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about loadable_variables for leap-year daily-data concatenation #813

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about loadable_variables for leap-year daily-data concatenation #813

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions