-
Notifications
You must be signed in to change notification settings - Fork 51
Description
Hey! I was very interested reading about this package as I have always wanted a way to "cache" open_mfdataset from xarray. My use case is that I have big datacubes for climate data sharing the exact same grid and data_vars, but they are divided into yearly zarr stores. As you probably can tell, I get an issue when I try to concatenate these stores into a virtual dataset because the chunk shape is inconsistent in leap-years (8784 vs 8760). In the FAQ it is written that:
Some of your variables have inconsistent-length chunks, and you want to be able to concatenate them together. For example you might have multiple virtual datasets with coordinates of inconsistent length (e.g., leap years within multi-year daily data). Loading them allows you to rechunk them however you like.
but I could not really get this to work. I am trying to create a minimal example - and would love to contribute with this as example documentation in a jupyter notebook if you'd like - but I am not sure if it is possible to do what I want here.
I have created a mini example to reproduce a similar use case for what I want. First, create couple of dummy HDF datasets with different time-axis length:
from datetime import datetime, timedelta
import numpy as np
january_times = [datetime(2025, 1, 1) + timedelta(hours=i) for i in range(0, 744)]
february_times = [datetime(2025, 2, 1) + timedelta(hours=i) for i in range(0, 672)]
lat = np.linspace(0, 10, 10)
lon = np.linspace(0, 10, 10)
ds_january = xr.Dataset(
coords=dict(time=january_times, lon=lon, lat=lat),
data_vars=dict(temperature=(("time", "lon", "lat"), np.ones((744, 10, 10)))),
).chunk(dict(time=-1)).to_netcdf("january.nc")
ds_february = xr.Dataset(
coords=dict(time=february_times, lon=lon, lat=lat),
data_vars=dict(temperature=(("time", "lon", "lat"), np.ones((672, 10, 10)))),
).chunk(dict(time=-1)).to_netcdf("february.nc")Then try to concat them using VirtualiZarr and xarray:
from obstore.store import LocalStore
from virtualizarr import open_virtual_dataset
from virtualizarr.parsers import HDFParser
from virtualizarr.registry import ObjectStoreRegistry
import xarray as xr
from pathlib import Path
store_path = Path.cwd()
file_paths = [str(store_path / "january.nc"), str(store_path / "february.nc")]
file_urls = [f"file://{file_path}" for file_path in file_paths]
store = LocalStore(prefix=store_path)
registry = ObjectStoreRegistry({file_url: store for file_url in file_urls})
parser = HDFParser()
datasets = []
for url in file_urls:
vds = open_virtual_dataset(
url=url, parser=parser, registry=registry, loadable_variables=["time", "lat", "lon"]
)
datasets.append(vds)
mds = xr.concat(datasets, dim="time")
mdsThis raises the exception:
ValueError: Cannot concatenate arrays with inconsistent chunk shapes: (672, 10, 10) vs (744, 10, 10) .Requires ZEP003 (Variable-length Chunks).
I have tried to re-chunk the vds before concat, but not yet figured out how this should be done. Is what I try to do here possible?