Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrievals in combination with xarray depend on the used cluster #19

Open
Tracked by #12
observingClouds opened this issue Dec 8, 2022 · 4 comments
Open
Tracked by #12

Comments

@observingClouds
Copy link
Owner

xarray seems to request different amounts of files concurrently depending on the cluster configuration:

levante interactive

import xarray as xr
ds=xr.open_mfdataset("slk:///arch/mh0010/m300408/showcase/dataset.zarr", engine="zarr")
ds.air.mean().compute()
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"0.0.0|0.0.1"}}]}'
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"0.1.0|0.1.1"}}]}'
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"1.0.0|1.0.1"}}]}'
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"1.1.0|1.1.1"}}]}'

levante compute

import xarray as xr
ds=xr.open_mfdataset("slk:///arch/mh0010/m300408/showcase/dataset.zarr", engine="zarr")
ds.air.mean().compute()
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.1
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.1
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.1
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"0.0.0|0.0.1|0.1.0|0.1.1|1.0.0|1.0.1|1.1.0|1.1.1"}}]}'
@antarcticrainforest
Copy link
Contributor

antarcticrainforest commented Jan 12, 2023

Again, this is related to my comment in #10 . I think

@observingClouds
Copy link
Owner Author

A few more insights. The different is likely caused by dask and how the dask task graph looks like. Depending on the available resources the task graph is created differently. If more resources are available then more data is requested.

#21 scans the task graph and gathers all open-dataset requests and is thereby independent of the available resources.

@observingClouds
Copy link
Owner Author

With #21 being merged the recommended way to retrieve files is to use the ds.slk.stage() command which operates independent of the available resources.

@observingClouds
Copy link
Owner Author

This issue unfortunately seems to remain. Dask still schedules the retrievals depending on the available resources to the cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants