-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
Description
What happened:
While using xr.save_mfdataset()
function with compute=False
I noticed that the function returns a dask.delayed
object, but it doesn't actually defer the computation i.e. it actually writes datasets right away.
What you expected to happen:
I expect the datasets to be written when I explicitly call .compute()
on the returned delayed object.
Minimal Complete Verifiable Example:
In [2]: import xarray as xr
In [3]: ds = xr.tutorial.open_dataset('rasm', chunks={})
In [4]: ds
Out[4]:
<xarray.Dataset>
Dimensions: (time: 36, x: 275, y: 205)
Coordinates:
* time (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
xc (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray>
yc (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray>
Dimensions without coordinates: x, y
Data variables:
Tair (time, y, x) float64 dask.array<chunksize=(36, 205, 275), meta=np.ndarray>
Attributes:
title: /workspace/jhamman/processed/R1002RBRxaaa01a/l...
institution: U.W.
source: RACM R1002RBRxaaa01a
output_frequency: daily
output_mode: averaged
convention: CF-1.4
references: Based on the initial model of Liang et al., 19...
comment: Output from the Variable Infiltration Capacity...
nco_openmp_thread_number: 1
NCO: "4.6.0"
history: Tue Dec 27 14:15:22 2016: ncatted -a dimension...
In [5]: path = "test.nc"
In [7]: ls -ltrh test.nc
ls: cannot access test.nc: No such file or directory
In [8]: tasks = xr.save_mfdataset(datasets=[ds], paths=[path], compute=False)
In [9]: tasks
Out[9]: Delayed('list-aa0b52e0-e909-4e65-849f-74526d137542')
In [10]: ls -ltrh test.nc
-rw-r--r-- 1 abanihi ncar 14K Jul 8 10:29 test.nc
Anything else we need to know?:
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-693.21.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.5
libnetcdf: 4.7.4
xarray: 0.15.1
pandas: 0.25.3
numpy: 1.18.5
scipy: 1.5.0
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: None
cftime: 1.2.0
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.20.0
distributed: 2.20.0
matplotlib: 3.2.1
cartopy: None
seaborn: None
numbagg: None
setuptools: 49.1.0.post20200704
pip: 20.1.1
conda: None
pytest: None
IPython: 7.16.1
sphinx: None
charlesbmi
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
shoyer commentedon Jul 9, 2020
The way
compute=False
currently works may be a little confusing. It doesn't actually delay creating files, it just delays writing the array data.andersy005 commentedon Jul 9, 2020
Interesting... I always assumed that all operations (including file creation) were delayed. So, this is a feature and not a bug then?
shoyer commentedon Jul 9, 2020
Well, there is certainly a case for file creation also being lazy -- it definitely is more intuitive! This was more of an oversight than an intentional omission. Metadata generally needs to be written from a single process, anyways, so we never got around to doing it with Dask.
That said, there are also some legitimate use cases where it is nice to be able to eagerly write metadata only without any array data. This is what we were proposing to do with
compute=False
into_zarr
:#4035
dcherian commentedon Jul 9, 2020
Here's an alternative
map_blocks
solution:There are two workarounds here though.
template=ds.time
because it has no chunk information andds.time.chunk({"time": 100})
silently does nothing because it is an IndexVariable. So the user function still needs thelen(ds.time) > 0
workaround.I think a cleaner API may be to have
dask.compute([write_block(block) for block in ds.to_delayed()])
whereds.to_delayed()
yields a bunch of tasks; each of which gives a Dataset wrapping one block of the underlying array.