-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What is your issue?
@keewis pointed out that it's weird that xarray.apply_ufunc
supports passing numpy/dask arrays directly, and I'm inclined to agree. I don't understand why we do, and think we should consider removing that feature.
Two arguments in favour of removing it:
- It exposes users to transposition errors
Consider this example:
In [1]: import xarray as xr
In [2]: import numpy as np
In [3]: arr = np.arange(12).reshape(3, 4)
In [4]: def mean(obj, dim):
...: # note: apply always moves core dimensions to the end
...: return xr.apply_ufunc(
...: np.mean, obj, input_core_dims=[[dim]], kwargs={"axis": -1}
...: )
...:
In [5]: mean(arr, dim='time')
Out[5]: array([1.5, 5.5, 9.5])
In [6]: mean(arr.T, dim='time')
Out[6]: array([4., 5., 6., 7.])
Transposing the input leads to a different result, with the value of the dim
kwarg effectively ignored. This kind of error is what xarray code is supposed to prevent by design.
- There is an alternative input pattern that doesn't require accepting bare arrays
Instead, any numpy/dask array can just be wrapped up into an xarray Variable
/NamedArray
before passing it to apply_ufunc
.
In [7]: from xarray.core.variable import Variable
In [8]: var = Variable(data=arr, dims=['time', 'space'])
In [9]: mean(var, dim='time')
Out[9]:
<xarray.Variable (space: 4)> Size: 32B
array([4., 5., 6., 7.])
In [10]: mean(var.T, dim='time')
Out[10]:
<xarray.Variable (space: 4)> Size: 32B
array([4., 5., 6., 7.])
This now guards against the transposition error, and puts the onus on the user to be clear about which axes of their array correspond to which dimension.
With Variable
/NamedArray
as public API, this latter pattern can handle every case that passing bare arrays in could.
I suggest we deprecate accepting bare arrays in favour of having users wrap them in Variable
/NamedArray
/DataArray
objects instead.
(Note 1: We also accept raw scalars, but this doesn't expose anyone to transposition errors.)
(Note 2: In a quick scan of the apply_ufunc
docstring, the docs on it in computation.rst
, and the extensive guide that @dcherian wrote in the xarray tutorial repository, I can't see any examples that actually pass bare arrays to apply_ufunc
.)
Activity
gmoutso commentedon Jun 11, 2024
I find myself defining two versions of functions often because I need to use the numpy versions with scipy.optimise and use the xarray versions as a convenience because it is cleaner with named dims. So this topic interests me. Dressing a numpy with Variable and using an xarray function is slower that the apply_ufunc and both slower than using a numpy ufunc.
TomNicholas commentedon Jun 12, 2024
Hi @gmoutso - I'm not clear from your comment whether you are for or against this change!
Well yes, there will be some overhead from wrapping with a higher-level API. But that overhead shouldn't scale with the size of the array. Here's results when I tried your just now for arrays of different sizes.
As expected they converge to the same execution time, so I don't think this overhead is a problem.
gmoutso commentedon Jun 28, 2024
Hi @TomNicholas . Thank you for your reply and benchmarks.
Actually I am neither for or against. I wanted to register why this topic would interest me, even if it is not clear to me either!
Currently I define two versions of functions for the same calculation. One for numpy for use in atomic calculations that will be wrapped with a
scipy.optimise
function. The other for xarray because it is easier to write, clearer to use, and much easier to use with xarray arrays. Although the time difference is due to the overhead (as your plots suggest), this is not good enough for me because scipy.optimise will call the function many many times for a single call. Eg minimise an output (here an atomic float as all output dims are output_core_dims that are none - the objective of minimise).Using xarray functions with numpy arrays (as the topic is here) is potentially useful but too slow in my use compared to pure numpy. I am not suggesting a solution, only registering my interest if useful to the discussion. Thank you!