Skip to content

Fancy isel on LazilyIndexedArray allocates memory of about the size of backing dataset #10311

@jder

Description

@jder
Contributor

What happened?

I ran into a large memory allocation when doing an isel on a zarr dataset with dask turned off (ie opened with open_dataset("mydataset.zarr", chunks=None)), which caused the Linux OOM killer to kill my process.

I used some memory profilers (memray + scalene) to dig in a little; it appears that a fancy isel on a LazilyIndexedArray ultimately causes indexing._combine_indexers to run np.brodcast_arrays(…) on a set of tuples which causes the broadcast to return an numpy array of the same shape as the underlying dataset. See MCVE for an example isel which I think tries to allocate exabytes of ram. (Not loading the data; when doing the lazy isel.)

The VectorizedIndexer created at the end of _combine_indexers also ends up being made of ndarrays of the size of the data-to-be-selected which could also be large.

What did you expect to happen?

I was hoping that I could index into a large dataset cheaply before loading it (eg by tracking ranges to be selected as slices). The original example and the one below both try to select a relatively small amount of the total data.

Minimal Complete Verifiable Example

import xarray as xr
from xarray.core import indexing
import numpy as np

def test_dataset_indexing_memory() -> None:
    # Mimics open_dataset with a large backing array and no dask.
    big_backing_array = indexing.LazilyIndexedArray(
        np.broadcast_to([1], (1_000_000, 1_000_000, 1_000_000))
    )
    data_array = xr.DataArray(
        big_backing_array,
        coords={
            "t": np.arange(1_000_000),
            "y": np.arange(1_000_000),
            "z": np.arange(1_000_000),
        },
    )
    print("running isel")
    # I think any fancy isel will do
    data_array.isel(t=xr.Variable(("new", "t"), [[0]]))

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
    Complete example — the example is self-contained, including all data and the text of any traceback.
    Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
    New issue — a search of GitHub Issues suggests this is not a duplicate.
    Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Anything else we need to know?

Happy to share memory profiles or other debugging info if that would be helpful. Thank you for taking a look!

Environment

INSTALLED VERSIONS

commit: c8affb3
python: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:23:34) [Clang 18.1.8 ]
python-bits: 64
OS: Darwin
OS-release: 24.4.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.4
libnetcdf: 4.9.2

xarray: 2025.4.1.dev1+g070af11f
pandas: 2.2.3
numpy: 2.2.5
scipy: 1.15.2
netCDF4: 1.7.2
pydap: 3.5.5
h5netcdf: 1.6.1
h5py: 3.12.1
zarr: 2.18.3
cftime: 1.6.4
nc_time_axis: 1.4.1
iris: 3.11.1
bottleneck: 1.4.2
dask: 2024.8.2
distributed: 2024.8.2
matplotlib: 3.10.1
cartopy: 0.24.0
seaborn: 0.13.2
numbagg: 0.9.0
fsspec: 2025.3.2
cupy: None
pint: None
sparse: 0.16.0
flox: 0.10.1
numpy_groupies: 0.11.2
setuptools: 80.1.0
pip: 25.1
conda: 24.9.2
pytest: 8.3.5
mypy: 1.15.0
IPython: None
sphinx: None

Activity

added
needs triageIssue that has not been reviewed by xarray team member
on May 12, 2025
added and removed
needs triageIssue that has not been reviewed by xarray team member
on May 27, 2025
dcherian

dcherian commented on May 28, 2025

@dcherian
Contributor

@jder is this fixed now?

jder

jder commented on May 28, 2025

@jder
ContributorAuthor

@dcherian sorry, no, the other fix I submitted was related (and made this worse) but doesn't solve the underlying broadcasting issue here. I haven't had a chance yet to dig into how to fix this one.

dcherian

dcherian commented on May 28, 2025

@dcherian
Contributor

After some exploring, this will take some major reworking

  1. data_array.isel(t=xr.Variable(("new", "t"), [[0]])) is translated to indexing with VectorizedIndexer([[0]], slice(None), slice(None))
  2. This is then "arrayized" to VectorizedIndexer(array([[[[0]]]]), array([[[[ 0], [ 1],...,[9999]]]], shape=(1, 1, 10000, 1)), array([[[[ 0, 1, 2, ..., 9997, 9998, 9999]]]], shape=(1, 1, 1, 10000))
  3. This is "combined" with the arrayized version of BasicIndexer(slice(None), slice(None), slice(None)) in
    return VectorizedIndexer(
    tuple(o[new_key.tuple] for o in np.broadcast_arrays(*old_key.tuple))
    )
    at which point we realize large indexer arrays in memory.

We could preserve any slice objects when creating a VectorizedIndexer instead of array-izing them (IIUC this is the fundamental problem). I think this would help in this case.

This will be pretty involved. For example, the logic for this loop would be more complicated to handle the slicing case because old_key.tuple will contain slices

return VectorizedIndexer(
tuple(o[new_key.tuple] for o in np.broadcast_arrays(*old_key.tuple))
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @jder@dcherian

        Issue actions

          Fancy isel on LazilyIndexedArray allocates memory of about the size of backing dataset · Issue #10311 · pydata/xarray