Skip to content

Question about loadable_variables for leap-year daily-data concatenation #813

@simentha

Description

@simentha

Hey! I was very interested reading about this package as I have always wanted a way to "cache" open_mfdataset from xarray. My use case is that I have big datacubes for climate data sharing the exact same grid and data_vars, but they are divided into yearly zarr stores. As you probably can tell, I get an issue when I try to concatenate these stores into a virtual dataset because the chunk shape is inconsistent in leap-years (8784 vs 8760). In the FAQ it is written that:

Some of your variables have inconsistent-length chunks, and you want to be able to concatenate them together. For example you might have multiple virtual datasets with coordinates of inconsistent length (e.g., leap years within multi-year daily data). Loading them allows you to rechunk them however you like.

but I could not really get this to work. I am trying to create a minimal example - and would love to contribute with this as example documentation in a jupyter notebook if you'd like - but I am not sure if it is possible to do what I want here.

I have created a mini example to reproduce a similar use case for what I want. First, create couple of dummy HDF datasets with different time-axis length:

from datetime import datetime, timedelta
import numpy as np

january_times = [datetime(2025, 1, 1) + timedelta(hours=i) for i in range(0, 744)]
february_times = [datetime(2025, 2, 1) + timedelta(hours=i) for i in range(0, 672)]
lat = np.linspace(0, 10, 10)
lon = np.linspace(0, 10, 10)

ds_january = xr.Dataset(
    coords=dict(time=january_times, lon=lon, lat=lat),
    data_vars=dict(temperature=(("time", "lon", "lat"), np.ones((744, 10, 10)))),
).chunk(dict(time=-1)).to_netcdf("january.nc")

ds_february = xr.Dataset(
    coords=dict(time=february_times, lon=lon, lat=lat),
    data_vars=dict(temperature=(("time", "lon", "lat"), np.ones((672, 10, 10)))),
).chunk(dict(time=-1)).to_netcdf("february.nc")

Then try to concat them using VirtualiZarr and xarray:

from obstore.store import LocalStore

from virtualizarr import open_virtual_dataset
from virtualizarr.parsers import HDFParser
from virtualizarr.registry import ObjectStoreRegistry
import xarray as xr

from pathlib import Path

store_path = Path.cwd()
file_paths = [str(store_path / "january.nc"), str(store_path / "february.nc")]
file_urls = [f"file://{file_path}" for file_path in file_paths]

store = LocalStore(prefix=store_path)
registry = ObjectStoreRegistry({file_url: store for file_url in file_urls})
parser = HDFParser()
datasets = []
for url in file_urls:
    vds = open_virtual_dataset(
        url=url, parser=parser, registry=registry, loadable_variables=["time", "lat", "lon"]
    )
    datasets.append(vds)

mds = xr.concat(datasets, dim="time")
mds

This raises the exception:

ValueError: Cannot concatenate arrays with inconsistent chunk shapes: (672, 10, 10) vs (744, 10, 10) .Requires ZEP003 (Variable-length Chunks).

I have tried to re-chunk the vds before concat, but not yet figured out how this should be done. Is what I try to do here possible?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions