Skip to content

Why does xr.apply_ufunc support numpy/dask.arrays? #8995

@TomNicholas

Description

@TomNicholas
Member

What is your issue?

@keewis pointed out that it's weird that xarray.apply_ufunc supports passing numpy/dask arrays directly, and I'm inclined to agree. I don't understand why we do, and think we should consider removing that feature.

Two arguments in favour of removing it:

  1. It exposes users to transposition errors

Consider this example:

In [1]: import xarray as xr

In [2]: import numpy as np

In [3]: arr = np.arange(12).reshape(3, 4)

In [4]: def mean(obj, dim):
   ...:     # note: apply always moves core dimensions to the end
   ...:     return xr.apply_ufunc(
   ...:         np.mean, obj, input_core_dims=[[dim]], kwargs={"axis": -1}
   ...:     )
   ...: 

In [5]: mean(arr, dim='time')
Out[5]: array([1.5, 5.5, 9.5])

In [6]: mean(arr.T, dim='time')
Out[6]: array([4., 5., 6., 7.])

Transposing the input leads to a different result, with the value of the dim kwarg effectively ignored. This kind of error is what xarray code is supposed to prevent by design.

  1. There is an alternative input pattern that doesn't require accepting bare arrays

Instead, any numpy/dask array can just be wrapped up into an xarray Variable/NamedArray before passing it to apply_ufunc.

In [7]: from xarray.core.variable import Variable

In [8]: var = Variable(data=arr, dims=['time', 'space'])

In [9]: mean(var, dim='time')
Out[9]: 
<xarray.Variable (space: 4)> Size: 32B
array([4., 5., 6., 7.])

In [10]: mean(var.T, dim='time')
Out[10]: 
<xarray.Variable (space: 4)> Size: 32B
array([4., 5., 6., 7.])

This now guards against the transposition error, and puts the onus on the user to be clear about which axes of their array correspond to which dimension.

With Variable/NamedArray as public API, this latter pattern can handle every case that passing bare arrays in could.

I suggest we deprecate accepting bare arrays in favour of having users wrap them in Variable/NamedArray/DataArray objects instead.

(Note 1: We also accept raw scalars, but this doesn't expose anyone to transposition errors.)

(Note 2: In a quick scan of the apply_ufunc docstring, the docs on it in computation.rst, and the extensive guide that @dcherian wrote in the xarray tutorial repository, I can't see any examples that actually pass bare arrays to apply_ufunc.)

Activity

gmoutso

gmoutso commented on Jun 11, 2024

@gmoutso

I find myself defining two versions of functions often because I need to use the numpy versions with scipy.optimise and use the xarray versions as a convenience because it is cleaner with named dims. So this topic interests me. Dressing a numpy with Variable and using an xarray function is slower that the apply_ufunc and both slower than using a numpy ufunc.

rnd = np.random.default_rng()
x1 = rnd.normal(size=(10, 100))
f1 = lambda x: x.mean(axis=-1)
x2 = xr.Variable(dims=["a", "b"], data=x1)
f2 = lambda x: xr.apply_ufunc(np.mean, x,
                              input_core_dims=[["b"]],
                              output_core_dims=[[]],
                              output_dtypes=[float],
                              kwargs={"axis": -1})
%timeit f1(x1)
%timeit f2(x2)
%timeit f2(x1)
TomNicholas

TomNicholas commented on Jun 12, 2024

@TomNicholas
MemberAuthor

Hi @gmoutso - I'm not clear from your comment whether you are for or against this change!

Dressing a numpy with Variable and using an xarray function is slower that the apply_ufunc and both slower than using a numpy ufunc.

Well yes, there will be some overhead from wrapping with a higher-level API. But that overhead shouldn't scale with the size of the array. Here's results when I tried your just now for arrays of different sizes.

sizes = [10, 100, 1000, 10000]
import time

times_np_fn_np_arr = []
times_xr_fn_np_arr = []
times_xr_fn_xr_var = []
for size in sizes:
    np_arr = rnd.normal(size=(size, 10 * size))
    xr_var = xr.Variable(dims=["a", "b"], data=np_arr)
    
    start_time = time.time()
    np_fn(np_arr)
    times_np_fn_np_arr.append(time.time() - start_time)

    start_time = time.time()
    xr_fn(np_arr)
    times_xr_fn_np_arr.append(time.time() - start_time)
    
    start_time = time.time()
    xr_fn(xr_var)
    times_xr_fn_xr_var.append(time.time() - start_time)
import matplotlib.pyplot as plt

# Plot the results
plt.figure(figsize=(7, 5))
plt.plot(sizes, times_np_fn_np_arr, label='np_fn_np_arr', marker='o')
plt.plot(sizes, times_xr_fn_np_arr, label='xr_fn_np_arr', marker='o')
plt.plot(sizes, times_xr_fn_xr_var, label='xr_fn_xr_var', marker='o')
plt.xlabel('Size')
plt.ylabel('Execution Time (seconds)')
plt.title('Execution Time vs. Size for Different Functions')
plt.xscale('log')
plt.yscale('log')
plt.legend()
plt.grid(True)
plt.show()
Screenshot 2024-06-12 at 11 13 53 AM

As expected they converge to the same execution time, so I don't think this overhead is a problem.

gmoutso

gmoutso commented on Jun 28, 2024

@gmoutso

Hi @TomNicholas . Thank you for your reply and benchmarks.

I'm not clear from your comment whether you are for or against this change!

Actually I am neither for or against. I wanted to register why this topic would interest me, even if it is not clear to me either!

Currently I define two versions of functions for the same calculation. One for numpy for use in atomic calculations that will be wrapped with a scipy.optimise function. The other for xarray because it is easier to write, clearer to use, and much easier to use with xarray arrays. Although the time difference is due to the overhead (as your plots suggest), this is not good enough for me because scipy.optimise will call the function many many times for a single call. Eg minimise an output (here an atomic float as all output dims are output_core_dims that are none - the objective of minimise).

Using xarray functions with numpy arrays (as the topic is here) is potentially useful but too slow in my use compared to pure numpy. I am not suggesting a solution, only registering my interest if useful to the discussion. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @gmoutso@TomNicholas

        Issue actions

          Why does xr.apply_ufunc support numpy/dask.arrays? · Issue #8995 · pydata/xarray