-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What happened?
I ran into a large memory allocation when doing an isel on a zarr dataset with dask turned off (ie opened with open_dataset("mydataset.zarr", chunks=None)
), which caused the Linux OOM killer to kill my process.
I used some memory profilers (memray + scalene) to dig in a little; it appears that a fancy isel on a LazilyIndexedArray
ultimately causes indexing._combine_indexers
to run np.brodcast_arrays(…)
on a set of tuples which causes the broadcast to return an numpy array of the same shape as the underlying dataset. See MCVE for an example isel which I think tries to allocate exabytes of ram. (Not loading the data; when doing the lazy isel.)
The VectorizedIndexer created at the end of _combine_indexers
also ends up being made of ndarrays of the size of the data-to-be-selected which could also be large.
What did you expect to happen?
I was hoping that I could index into a large dataset cheaply before loading it (eg by tracking ranges to be selected as slices). The original example and the one below both try to select a relatively small amount of the total data.
Minimal Complete Verifiable Example
import xarray as xr
from xarray.core import indexing
import numpy as np
def test_dataset_indexing_memory() -> None:
# Mimics open_dataset with a large backing array and no dask.
big_backing_array = indexing.LazilyIndexedArray(
np.broadcast_to([1], (1_000_000, 1_000_000, 1_000_000))
)
data_array = xr.DataArray(
big_backing_array,
coords={
"t": np.arange(1_000_000),
"y": np.arange(1_000_000),
"z": np.arange(1_000_000),
},
)
print("running isel")
# I think any fancy isel will do
data_array.isel(t=xr.Variable(("new", "t"), [[0]]))
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.Complete example — the example is self-contained, including all data and the text of any traceback.Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.New issue — a search of GitHub Issues suggests this is not a duplicate.Recent environment — the issue occurs with the latest version of xarray and its dependencies.To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Relevant log output
Anything else we need to know?
Happy to share memory profiles or other debugging info if that would be helpful. Thank you for taking a look!
Environment
INSTALLED VERSIONS
commit: c8affb3
python: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:23:34) [Clang 18.1.8 ]
python-bits: 64
OS: Darwin
OS-release: 24.4.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.4
libnetcdf: 4.9.2
xarray: 2025.4.1.dev1+g070af11f
pandas: 2.2.3
numpy: 2.2.5
scipy: 1.15.2
netCDF4: 1.7.2
pydap: 3.5.5
h5netcdf: 1.6.1
h5py: 3.12.1
zarr: 2.18.3
cftime: 1.6.4
nc_time_axis: 1.4.1
iris: 3.11.1
bottleneck: 1.4.2
dask: 2024.8.2
distributed: 2024.8.2
matplotlib: 3.10.1
cartopy: 0.24.0
seaborn: 0.13.2
numbagg: 0.9.0
fsspec: 2025.3.2
cupy: None
pint: None
sparse: 0.16.0
flox: 0.10.1
numpy_groupies: 0.11.2
setuptools: 80.1.0
pip: 25.1
conda: 24.9.2
pytest: 8.3.5
mypy: 1.15.0
IPython: None
sphinx: None
Activity
dcherian commentedon May 28, 2025
@jder is this fixed now?
jder commentedon May 28, 2025
@dcherian sorry, no, the other fix I submitted was related (and made this worse) but doesn't solve the underlying broadcasting issue here. I haven't had a chance yet to dig into how to fix this one.
dcherian commentedon May 28, 2025
After some exploring, this will take some major reworking
data_array.isel(t=xr.Variable(("new", "t"), [[0]]))
is translated to indexing withVectorizedIndexer([[0]], slice(None), slice(None))
VectorizedIndexer(array([[[[0]]]]), array([[[[ 0], [ 1],...,[9999]]]], shape=(1, 1, 10000, 1)), array([[[[ 0, 1, 2, ..., 9997, 9998, 9999]]]], shape=(1, 1, 1, 10000))
BasicIndexer(slice(None), slice(None), slice(None))
inxarray/xarray/core/indexing.py
Lines 977 to 979 in 303748c
We could preserve any slice objects when creating a
VectorizedIndexer
instead of array-izing them (IIUC this is the fundamental problem). I think this would help in this case.This will be pretty involved. For example, the logic for this loop would be more complicated to handle the slicing case because
old_key.tuple
will contain slicesxarray/xarray/core/indexing.py
Lines 977 to 979 in 303748c