Skip to content

xr.save_mfdataset() doesn't honor compute=False argument #4209

@andersy005

Description

@andersy005
Member

What happened:

While using xr.save_mfdataset() function with compute=False I noticed that the function returns a dask.delayed object, but it doesn't actually defer the computation i.e. it actually writes datasets right away.

What you expected to happen:

I expect the datasets to be written when I explicitly call .compute() on the returned delayed object.

Minimal Complete Verifiable Example:

In [2]: import xarray as xr

In [3]: ds = xr.tutorial.open_dataset('rasm', chunks={})

In [4]: ds
Out[4]:
<xarray.Dataset>
Dimensions:  (time: 36, x: 275, y: 205)
Coordinates:
  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
    xc       (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray>
    yc       (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray>
Dimensions without coordinates: x, y
Data variables:
    Tair     (time, y, x) float64 dask.array<chunksize=(36, 205, 275), meta=np.ndarray>
Attributes:
    title:                     /workspace/jhamman/processed/R1002RBRxaaa01a/l...
    institution:               U.W.
    source:                    RACM R1002RBRxaaa01a
    output_frequency:          daily
    output_mode:               averaged
    convention:                CF-1.4
    references:                Based on the initial model of Liang et al., 19...
    comment:                   Output from the Variable Infiltration Capacity...
    nco_openmp_thread_number:  1
    NCO:                       "4.6.0"
    history:                   Tue Dec 27 14:15:22 2016: ncatted -a dimension...

In [5]: path = "test.nc"

In [7]: ls -ltrh test.nc
ls: cannot access test.nc: No such file or directory

In [8]: tasks = xr.save_mfdataset(datasets=[ds], paths=[path], compute=False)

In [9]: tasks
Out[9]: Delayed('list-aa0b52e0-e909-4e65-849f-74526d137542')

In [10]: ls -ltrh test.nc
-rw-r--r-- 1 abanihi ncar 14K Jul  8 10:29 test.nc

Anything else we need to know?:

Environment:

Output of xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.6 | packaged by conda-forge | (default, Jun  1 2020, 18:57:50)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-693.21.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.5
libnetcdf: 4.7.4

xarray: 0.15.1
pandas: 0.25.3
numpy: 1.18.5
scipy: 1.5.0
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: None
cftime: 1.2.0
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.20.0
distributed: 2.20.0
matplotlib: 3.2.1
cartopy: None
seaborn: None
numbagg: None
setuptools: 49.1.0.post20200704
pip: 20.1.1
conda: None
pytest: None
IPython: 7.16.1
sphinx: None

Activity

shoyer

shoyer commented on Jul 9, 2020

@shoyer
Member

The way compute=False currently works may be a little confusing. It doesn't actually delay creating files, it just delays writing the array data.

andersy005

andersy005 commented on Jul 9, 2020

@andersy005
MemberAuthor

The way compute=False currently works may be a little confusing. It doesn't actually delay creating files, it just delays writing the array data.

Interesting... I always assumed that all operations (including file creation) were delayed. So, this is a feature and not a bug then?

shoyer

shoyer commented on Jul 9, 2020

@shoyer
Member

The way compute=False currently works may be a little confusing. It doesn't actually delay creating files, it just delays writing the array data.

Interesting... I always assumed that all operations (including file creation) were delayed. So, this is a feature and not a bug then?

Well, there is certainly a case for file creation also being lazy -- it definitely is more intuitive! This was more of an oversight than an intentional omission. Metadata generally needs to be written from a single process, anyways, so we never got around to doing it with Dask.

That said, there are also some legitimate use cases where it is nice to be able to eagerly write metadata only without any array data. This is what we were proposing to do with compute=False in to_zarr:
#4035

dcherian

dcherian commented on Jul 9, 2020

@dcherian
Contributor

Here's an alternative map_blocks solution:

def write_block(ds, t0):
    if len(ds.time) > 0:
        fname = (ds.time[0] - t0).values.astype("timedelta64[h]").astype(int)
        ds.to_netcdf(f"temp/file-{fname:06d}.nc")

    # dummy return
    return ds.time


ds = xr.tutorial.open_dataset("air_temperature", chunks={"time": 100})
ds.map_blocks(write_block, kwargs=dict(t0=ds.time[0])).compute(scheduler="processes")

There are two workarounds here though.

  1. The user function always has to return something.
  2. We can't provide template=ds.time because it has no chunk information and ds.time.chunk({"time": 100}) silently does nothing because it is an IndexVariable. So the user function still needs the len(ds.time) > 0 workaround.

I think a cleaner API may be to have dask.compute([write_block(block) for block in ds.to_delayed()]) where ds.to_delayed() yields a bunch of tasks; each of which gives a Dataset wrapping one block of the underlying array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @shoyer@dcherian@andersy005

        Issue actions

          `xr.save_mfdataset()` doesn't honor `compute=False` argument · Issue #4209 · pydata/xarray