Faster fastpath #9001

hmaarrfk · 2024-05-05T14:51:22Z

I noticed that slicing into datetime64[ns] arrays was really starting to limit our overall performance.

What is strangest is that it is faster to do it on a lazy array than an in-memory array.

I think it comes down to the code on the main branch triggers:

xarray/xarray/core/variable.py

Line 220 in aaa778c

def _possibly_convert_objects(values):

which does a numpy->pandas -> numpy conversion for safety.

But that seems a little over the top.

My benchmark of interest is:

Indexing into a 1D array with a scalar to get a single entry from it out.

This often occurs when we slice into our larger datasets.

on main

python software_timestamp_benchmark.py
software_timestamp.nc
h5netcdf  lazy      : 100%|███████████| 100000/100000 [00:11<00:00, 8664.24it/s]
h5netcdf  computed  : 100%|███████████| 100000/100000 [00:16<00:00, 5976.98it/s]
netcdf4   lazy      : 100%|███████████| 100000/100000 [00:11<00:00, 8453.77it/s]
netcdf4   computed  : 100%|███████████| 100000/100000 [00:16<00:00, 5924.96it/s]
software_timestamp_float64.nc
h5netcdf  lazy      : 100%|███████████| 100000/100000 [00:11<00:00, 9042.25it/s]
h5netcdf  computed  : 100%|██████████| 100000/100000 [00:07<00:00, 14157.30it/s]
netcdf4   lazy      : 100%|███████████| 100000/100000 [00:11<00:00, 9005.22it/s]
netcdf4   computed  : 100%|██████████| 100000/100000 [00:07<00:00, 13935.46it/s]

on this branch:

$ python software_timestamp_benchmark.py
software_timestamp.nc
h5netcdf  lazy      : 100%|███████████| 100000/100000 [00:11<00:00, 8951.20it/s]
h5netcdf  computed  : 100%|██████████| 100000/100000 [00:06<00:00, 14793.26it/s]
netcdf4   lazy      : 100%|███████████| 100000/100000 [00:11<00:00, 8774.30it/s]
netcdf4   computed  : 100%|██████████| 100000/100000 [00:06<00:00, 14671.77it/s]
software_timestamp_float64.nc
h5netcdf  lazy      : 100%|███████████| 100000/100000 [00:10<00:00, 9364.32it/s]
h5netcdf  computed  : 100%|██████████| 100000/100000 [00:06<00:00, 14955.89it/s]
netcdf4   lazy      : 100%|███████████| 100000/100000 [00:10<00:00, 9324.61it/s]
netcdf4   computed  : 100%|██████████| 100000/100000 [00:06<00:00, 14697.09it/s]

The benchmark

import xarray
# from tqdm import tqdm
tqdm = lambda x, *args, **kwargs: x

# dataset = xarray.open_dataset('.nc')
for filename in [
    # 'software_timestamp_float64.nc',
    'software_timestamp.nc',
]:
    print(filename)
    for engine in [
        'h5netcdf', 
        # 'netcdf4',
    ]:
        for compute in [
            # False, 
            True,
        ]:
            dataset = xarray.open_dataset(filename, engine='h5netcdf')
            N_frames = dataset.sizes['frame_number']
            N_iters = 100_000
    
            software_timestamp = dataset['software_timestamp']
            if compute:
                software_timestamp = software_timestamp.compute()
                compute_str = "computed"
            else:
                compute_str = "lazy"
            desc = f"{engine:10s}{compute_str:10s}"
            for i in tqdm(range(N_iters), desc=desc):
                software_timestamp.isel(frame_number=i % N_frames)
            dataset.close()

I can try to expand the benchmark, though it is difficult to "see" these slowdowns with toy examples sometimes.

software_timestamp.zip

Closes #xxxx
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

xref: #2799
xref: #7045

max-sixty · 2024-05-05T18:04:13Z

xarray/core/variable.py

-    if fastpath and getattr(data, "ndim", 0) > 0:
-        # can't use fastpath (yet) for scalars
-        return cast("T_DuckArray", _maybe_wrap_data(data))
+    ndim = getattr(data, "ndim", None)


Small point relative to avoiding _maybe_wrap_data:

Wouldn't we only want to get ndim if fastpath is True? So we could only change the comparison to 0 to a comparison to None on L272?

Thank you for the thorough and quick review. you are correct.

Looks like both pd.Index and ExtensionArray define ndim so should be OK. It'd be nice to add this as a comment.

max-sixty · 2024-05-05T22:54:11Z

Merging pending tests. I'm not 100% on the implications of skipping _maybe_wrap_data in some cases, so if anyone knows whether this was required, we can reassess. Thanks @hmaarrfk

This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.

* main: Avoid auto creation of indexes in concat (#8872) Fix benchmark CI (#9013) Avoid extra read from disk when creating Pandas Index. (#8893) Add a benchmark to monitor performance for large dataset indexing (#9012) Zarr: Optimize `region="auto"` detection (#8997) Trigger CI only if code files are modified. (#9006) Fix for ruff 0.4.3 (#9007) Port negative frequency fix for `pandas.date_range` to `cftime_range` (#8999) Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (#9004) Speed up localize (#8536) Simplify fast path (#9001) Add argument check_dims to assert_allclose to allow transposed inputs (#5733) (#8991) Fix syntax error in test related to cupy (#9000)

* backend-indexing: Trigger CI only if code files are modified. (pydata#9006) Enable explicit use of key tuples (instead of *Indexer objects) in indexing adapters and explicitly indexed arrays (pydata#8870) add `.oindex` and `.vindex` to `BackendArray` (pydata#8885) temporary enable CI triggers on feature branch Avoid auto creation of indexes in concat (pydata#8872) Fix benchmark CI (pydata#9013) Avoid extra read from disk when creating Pandas Index. (pydata#8893) Add a benchmark to monitor performance for large dataset indexing (pydata#9012) Zarr: Optimize `region="auto"` detection (pydata#8997) Trigger CI only if code files are modified. (pydata#9006) Fix for ruff 0.4.3 (pydata#9007) Port negative frequency fix for `pandas.date_range` to `cftime_range` (pydata#8999) Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (pydata#9004) Speed up localize (pydata#8536) Simplify fast path (pydata#9001) Add argument check_dims to assert_allclose to allow transposed inputs (pydata#5733) (pydata#8991) Fix syntax error in test related to cupy (pydata#9000)

This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.

hmaarrfk force-pushed the fast_path branch from 4c6ce2f to ebf9fed Compare May 5, 2024 14:59

hmaarrfk closed this May 5, 2024

hmaarrfk reopened this May 5, 2024

hmaarrfk mentioned this pull request May 5, 2024

Micro optimizations to improve indexing #9002

Merged

4 tasks

max-sixty added the plan to merge Final call for comments label May 5, 2024

max-sixty reviewed May 5, 2024

View reviewed changes

Simplify fast path

a2bb9d8

hmaarrfk force-pushed the fast_path branch from f6fed86 to a2bb9d8 Compare May 5, 2024 22:37

max-sixty enabled auto-merge (squash) May 5, 2024 22:53

max-sixty merged commit 50f8726 into pydata:main May 5, 2024

hmaarrfk mentioned this pull request May 6, 2024

Micro optimize dataset.isel for speed on large datasets #9003

Open

4 tasks

andersy005 pushed a commit that referenced this pull request May 10, 2024

Simplify fast path (#9001)

bce79d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Faster fastpath #9001

Faster fastpath #9001

Uh oh!

hmaarrfk commented May 5, 2024 •

edited

Loading

Uh oh!

max-sixty May 5, 2024

Uh oh!

hmaarrfk May 5, 2024

Uh oh!

dcherian May 6, 2024

Uh oh!

max-sixty commented May 5, 2024

Uh oh!

Uh oh!

Uh oh!

Faster fastpath #9001

Faster fastpath #9001

Uh oh!

Conversation

hmaarrfk commented May 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-sixty May 5, 2024

Choose a reason for hiding this comment

Uh oh!

hmaarrfk May 5, 2024

Choose a reason for hiding this comment

Uh oh!

dcherian May 6, 2024

Choose a reason for hiding this comment

Uh oh!

max-sixty commented May 5, 2024

Uh oh!

Uh oh!

hmaarrfk commented May 5, 2024 •

edited

Loading