-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Micro optimizations to improve indexing #9002
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
d975fba
to
c2a065d
Compare
xarray/core/indexes.py
Outdated
return self.__id_coord_names | ||
|
||
def _create_id_coord_names(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its strange that the function _create_id_coord_names
gets hit all the time by the profiler. I would have thought that once the dataset is created this function would only be called once.
8c49aba
to
2b5e936
Compare
@hmaarrfk Sorry Marc, moving to ready for review was by accident. Still have problems to properly use GitHub Android app. |
the github app is terrible..... i keep making similar mistakes on my phone. |
5d4e9b2
to
f958953
Compare
Ok enough for today. But generally speaking, I think the the main problem, large compatibility with Pandas is a known issue. However, I think that some things are simply strange in the sense that certain You can see some of the numbers that I use. I use a The function It drops the performance on my Threadripper 2950x from 14kits/sec to 12kits/sec. still an improvement from the 9kits/sec where this pull request #9001 leaves off. |
2c202ff
to
1a659bc
Compare
feel free to push any cleanup.s |
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
xarray/core/variable.py
Outdated
@@ -2624,6 +2624,10 @@ def __init__(self, dims, data, attrs=None, encoding=None, fastpath=False): | |||
if self.ndim != 1: | |||
raise ValueError(f"{type(self).__name__} objects must be 1-dimensional") | |||
|
|||
# Avoid further checks if fastpath is True | |||
if fastpath: | |||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is causing:
=============================================================================== FAILURES ===============================================================================
___________________________________________________________________ test_field_access[365_day-year] ____________________________________________________________________
data = <xarray.DataArray 'data' (lon: 10, lat: 10, time: 100)> Size: 80kB
array([[[0.5488135 , 0.71518937, 0.60276338, ..., 0....222 4.444 6.667 ... 13.33 15.56 17.78 20.0
* time (time) object 800B 2000-01-01 00:00:00 ... 2000-01-05 03:00:00
field = 'year'
@requires_cftime
@pytest.mark.parametrize(
"field", ["year", "month", "day", "hour", "dayofyear", "dayofweek"]
)
def test_field_access(data, field) -> None:
result = getattr(data.time.dt, field)
expected = xr.DataArray(
getattr(xr.coding.cftimeindex.CFTimeIndex(data.time.values), field),
name=field,
coords=data.time.coords,
dims=data.time.dims,
)
> assert_equal(result, expected)
E AssertionError: Left and right DataArray objects are not equal
/home/mark/git/xarray/xarray/tests/test_accessor_dt.py:442: AssertionError
With the removal of the fastpath
We get most of the speedups, We seem to lose out on Honestly, I'll take it. I understand that fastpaths are not always good to add for maintainability sake. Let me cleanup, and rerun the benchamrks again |
Running the benchmarks again:
|
Thanks for the fix, i guess that was a merge conflict? |
Ah last request: can you add a whats-new entry |
done. |
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
* upstream/main: [skip-ci] Try fixing hypothesis CI trigger (pydata#9112) Undo custom padding-top. (pydata#9107) add remaining core-dev citations [skip-ci][skip-rtd] (pydata#9110) Add user survey announcement to docs (pydata#9101) skip the `pandas` datetime roundtrip test with `pandas=3.0` (pydata#9104) Adds Matt Savoie to CITATION.cff (pydata#9103) [skip-ci] Fix skip-ci for hypothesis (pydata#9102) open_datatree performance improvement on NetCDF, H5, and Zarr files (pydata#9014) Migrate datatree io.py and common.py into xarray/core (pydata#9011) Micro optimizations to improve indexing (pydata#9002) (fix): don't handle time-dtypes as extension arrays in `from_dataframe` (pydata#9042)
* conda instead of mamba * Make speedups using fastpath * Change core logic to apply_indexes_fast * Always have fastpath=True in one path * Remove basicindexer fastpath=True * Duplicate a comment * Add comments * revert asv changes * Avoid fastpath=True assignment * Remove changes to basicindexer * Do not do fast fastpath for IndexVariable * Remove one unecessary change * Remove one more fastpath * Revert uneeded change to PandasIndexingAdapter * Update xarray/core/indexes.py * Update whats-new.rst * Update whats-new.rst * fix whats-new --------- Co-authored-by: Deepak Cherian <[email protected]> Co-authored-by: Deepak Cherian <[email protected]>
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
These are some micro optimizaitons to improve indexing within xarray objects.
Generally speaking, the use of
isinstance
in python is pretty slow. Its pretty hard to avoid with the amount of "niceness" that xarray does.However, I feel like these changes are "ok" and kick ups the performance of
#9001
from
to
With my "controverisal" fix for
_apply_indexes_fast
:without it:
a nice 40% boost!
Benchmarked on
its not the best processor, but it doesn't throttle!
I'll mark this as not draft when it is ready for review.done!Concepts used:
LazilyIndexedArray
now store the shape at creation time. Otherwise this was causing the shape to be recomputed many times during indexing operations that use operators likelen
orshape
to broadcast the indexing tuple._id_coord_names
by avoidiningxindexes.group_by_index()
giving a 20% speedup.Findings
_id_coord_names
has terrible performance... I had to avoid it to get isel to speed up. It is likely that other code paths can be sped up the same way too. I just don't know how to use those code paths so i'm hesitant to go down that rabbit hole myself.whats-new.rst
api.rst
xref: #2799
xref: #7045