Skip to content

Remote access patterns using xarray. #237

Open
@betolink

Description

@betolink

I'm not sure if this will fit in the upcoming (potential) SciPy tutorial or somewhere else, I think it could be helpful to include a mini-guide on access patterns to remote storage. I think that one of the key strengths of xarray is in a way, a weakness. I'm thinking about how powerful the abstractions are when it comes to open a multi-file datasets and how this could hide the nuances of different back-end storage types.

When a new user sees this and they get a data cube, it's like magic!

ds = xr.open_dataset(reference, engine="zarr")

and although this is the cloud-native way, a considerable amount of data is still in archival formats or available through a service like Opendap. In an ideal world, users shouldn't care in which format/location their data is, but I've run into multiple instances where is not that xarray is not doing its job but the data is in HDF on a slow server across the next continent.

Sometimes there are workarounds, from using different sources(e.g. Planetary Computer, GEE) that serve the same data but on a cloud optimized format, to the use of Kerchunk or using clever caching strategies. I feel that some of these topics are buried in threads in Github and not necessarily exposed in the documentation.

The idea would be to quickly illustrate, what xarray would do if I have files of type X and this access pattern:

file_set = [fsspec.open(f) for f in files]
ds = xr.open_mfdataset(file_set) 

What would happen if my files are HDF4, NetCDF, HDF5, what's the step 1, 2, 3... can we make it faster? how?
What if the data is behind OPeNDAP? etc

I also wonder if this information is already out there in the docs and perhaps just needs to be compiled into a single notebook, I volunteer to start one if is not.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions