-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(feat): Support for pandas
ExtensionArray
#8723
Merged
Merged
Changes from 89 commits
Commits
Show all changes
101 commits
Select commit
Hold shift + click to select a range
b2712f1
(feat): first pass supporting extension arrays
ilan-gold 47bddd2
(feat): categorical tests + functionality
ilan-gold dc8b788
(feat): use multiple dispatch for unimplemented ops
ilan-gold 75524c8
(feat): implement (not really) broadcasting
ilan-gold c9ab452
(chore): add more `groupby` tests
ilan-gold 1f3d0fa
(fix): fix more groupby incompatibility
ilan-gold 8a70e3c
(bug): fix unused categories
ilan-gold f5a6505
(chore): refactor dispatched methods + tests
ilan-gold 08a4feb
(fix): shared type should check for extension arrays first and then f…
ilan-gold d5b218b
(refactor): tests moved
ilan-gold 00256fa
(chore): more higher level tests
ilan-gold b7ddbd6
(feat): to/from dataframe
ilan-gold a165851
(chore): check for plum import
ilan-gold a826edd
(fix): `__setitem__`/`__getitem__`
ilan-gold fde19ea
(chore): disallow stacking
ilan-gold 4c55707
(fix): `pyproject.toml`
ilan-gold 58ba17d
(fix): `as_shared_type` fix
ilan-gold a255310
(chore): add variable tests
ilan-gold 4e78b7e
(fix): dask + categoricals
ilan-gold d9cedf5
(chore): notes/docs
ilan-gold 426664d
(chore): remove old testing file
ilan-gold 22ca77d
(chore): remove ocmmented out code
ilan-gold f32cfdf
Merge branch 'main' into extension_arrays
ilan-gold 60f8927
(fix): import plum dispatch
ilan-gold ff22d76
Merge branch 'extension_arrays' of github.com:ilan-gold/xarray into e…
ilan-gold 2153e81
Merge branch 'main' into extension_arrays
ilan-gold b6d0b31
(refactor): use `is_extension_array_dtype` as much as possible
ilan-gold d285871
Merge branch 'extension_arrays' of github.com:ilan-gold/xarray into e…
ilan-gold d847277
Merge branch 'main' into extension_arrays
ilan-gold 8238c64
(refactor): `extension_array`->`array` + move to `indexing`
ilan-gold 1260cd4
Merge branch 'extension_arrays' of github.com:ilan-gold/xarray into e…
ilan-gold b04ef98
(refactor): change order of classes
ilan-gold b9937bf
(chore): add small pyarrow test
ilan-gold 0bba03f
(fix): fix some mypy issues
ilan-gold b714549
(fix): don't register unregisterable method
ilan-gold a3a678c
(fix): appease mypy
ilan-gold e521844
(fix): more sensible default implemetations allow most use without `p…
ilan-gold 2d3e930
(fix): handling `pyarrow` tests
ilan-gold 04c9969
(fix): actually do import correctly
ilan-gold 5514539
Merge branch 'main' into extension_arrays
ilan-gold bedfa5c
(fix): `reduce` condition
ilan-gold e6c2690
Merge branch 'main' into extension_arrays
ilan-gold 82dbda9
(fix): column ordering for dataframes
ilan-gold 12217ed
(refactor): remove encoding business
ilan-gold dd5b87d
(refactor): raise error for dask + extension array
ilan-gold 761a874
Merge branch 'extension_arrays' of github.com:ilan-gold/xarray into e…
ilan-gold 52cabc8
Merge branch 'main' into extension_arrays
ilan-gold e0d58fa
(fix): only wrap `ExtensionDuckArray` that has a `.array` which is a …
ilan-gold c1e0e64
(fix): use duck array equality method, not pandas
ilan-gold 17e3390
(refactor): bye plum!
ilan-gold dd2ef39
Merge branch 'main' into extension_arrays
ilan-gold c8e6bfe
(fix): `and` to `or` for casting to `ExtensionDuckArray`
ilan-gold b2a9517
(fix): check for class, not type
ilan-gold f5e1bd0
Merge branch 'main' into extension_arrays
ilan-gold 407fad1
(fix): only support native endianness
ilan-gold 3a47f09
Merge branch 'extension_arrays' of github.com:ilan-gold/xarray into e…
ilan-gold fdd3de4
Merge branch 'main' into extension_arrays
ilan-gold 6b23629
Merge branch 'main' into extension_arrays
ilan-gold 1c9047f
(refactor): no need for superfluous checks in `_maybe_wrap_data`
ilan-gold 9be6b03
Merge branch 'extension_arrays' of github.com:ilan-gold/xarray into e…
ilan-gold d9304f1
(chore): clean up docs to no longer reference `plum`
ilan-gold 6ec6725
(fix): no longer allow `ExtensionDuckArray` to wrap `ExtensionDuckArray`
ilan-gold bc9ac4c
(refactor): move `implements` logic to `indexing`
ilan-gold 1e906db
Merge branch 'main' into extension_arrays
ilan-gold 6fb8668
(refactor): `indexing.py` -> `extension_array.py`
ilan-gold 8f034b4
(refactor): `ExtensionDuckArray` -> `PandasExtensionArray`
ilan-gold 90a6de6
Merge branch 'main' into extension_arrays
dcherian 2bd422a
Merge branch 'main' into extension_arrays
ilan-gold ff67943
Merge branch 'main' into extension_arrays
ilan-gold 661d9f2
(fix): add writeable property
ilan-gold caee1c6
(fix): don't check writeable for `PandasExtensionArray`
ilan-gold 1d12f5e
(fix): move check eariler
ilan-gold 31dfbb5
Merge branch 'main' into extension_arrays
ilan-gold 23b347f
Merge branch 'main' into extension_arrays
ilan-gold 902c74b
(refactor): correct guard clause
ilan-gold 0b64506
(chore): remove unnecessary `AttributeError`
ilan-gold 0c7e023
(feat): singleton wrapped as array
ilan-gold dd7fe98
(feat): remove shared dtype casting
ilan-gold f0df768
(feat): loop once over `dataframe.items`
ilan-gold e2f0487
(feat): add `__len__` attribute
ilan-gold 1eb6741
(fix): ensure constructor recieves `pd.Categorical`
ilan-gold 2a7300a
Merge branch 'extension_arrays' of github.com:ilan-gold/xarray into e…
ilan-gold 9cceadc
Update xarray/core/extension_array.py
ilan-gold f2588c1
Update xarray/core/extension_array.py
ilan-gold a0a63bd
(fix): drop condition for categorical corrected
ilan-gold 5bb2bde
Merge branch 'main' into extension_arrays
ilan-gold f85f166
Merge branch 'main' into extension_arrays
ilan-gold 7ecdeba
Merge branch 'main' into extension_arrays
ilan-gold 6bc40fc
Merge branch 'main' into extension_arrays
ilan-gold e9dc53f
Apply suggestions from code review
dcherian 4791799
(chore): test `chunk` behavior
ilan-gold c649362
Merge branch 'extension_arrays' of github.com:ilan-gold/xarray into e…
ilan-gold fc60dcf
Merge branch 'main' into extension_arrays
ilan-gold 0374086
Update xarray/core/variable.py
dcherian b9515a6
Merge branch 'main' into extension_arrays
dcherian 72bf807
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 63b6c42
(fix): bring back error
ilan-gold 1d18439
(chore): add test for dropping cat for mean
ilan-gold 17f05da
Update whats-new.rst
dcherian c906c81
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] e6db83b
Merge branch 'main' into extension_arrays
ilan-gold File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -129,6 +129,7 @@ module = [ | |
"opt_einsum.*", | ||
"pandas.*", | ||
"pooch.*", | ||
"pyarrow.*", | ||
"pydap.*", | ||
"pytest.*", | ||
"scipy.*", | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
from __future__ import annotations | ||
|
||
from collections.abc import Sequence | ||
from typing import Callable, Generic | ||
|
||
import numpy as np | ||
import pandas as pd | ||
from pandas.api.types import is_extension_array_dtype | ||
|
||
from xarray.core.types import DTypeLikeSave, T_ExtensionArray | ||
|
||
HANDLED_EXTENSION_ARRAY_FUNCTIONS: dict[Callable, Callable] = {} | ||
|
||
|
||
def implements(numpy_function): | ||
"""Register an __array_function__ implementation for MyArray objects.""" | ||
|
||
def decorator(func): | ||
HANDLED_EXTENSION_ARRAY_FUNCTIONS[numpy_function] = func | ||
return func | ||
|
||
return decorator | ||
|
||
|
||
@implements(np.issubdtype) | ||
def __extension_duck_array__issubdtype( | ||
extension_array_dtype: T_ExtensionArray, other_dtype: DTypeLikeSave | ||
) -> bool: | ||
return False # never want a function to think a pandas extension dtype is a subtype of numpy | ||
|
||
|
||
@implements(np.broadcast_to) | ||
def __extension_duck_array__broadcast(arr: T_ExtensionArray, shape: tuple): | ||
if shape[0] == len(arr) and len(shape) == 1: | ||
return arr | ||
raise NotImplementedError("Cannot broadcast 1d-only pandas categorical array.") | ||
|
||
|
||
@implements(np.stack) | ||
def __extension_duck_array__stack(arr: T_ExtensionArray, axis: int): | ||
raise NotImplementedError("Cannot stack 1d-only pandas categorical array.") | ||
|
||
|
||
@implements(np.concatenate) | ||
def __extension_duck_array__concatenate( | ||
arrays: Sequence[T_ExtensionArray], axis: int = 0, out=None | ||
) -> T_ExtensionArray: | ||
return type(arrays[0])._concat_same_type(arrays) | ||
|
||
|
||
@implements(np.where) | ||
def __extension_duck_array__where( | ||
condition: np.ndarray, x: T_ExtensionArray, y: T_ExtensionArray | ||
) -> T_ExtensionArray: | ||
if ( | ||
isinstance(x, pd.Categorical) | ||
and isinstance(y, pd.Categorical) | ||
and x.dtype != y.dtype | ||
): | ||
x = x.add_categories(set(y.categories).difference(set(x.categories))) | ||
y = y.add_categories(set(x.categories).difference(set(y.categories))) | ||
return pd.Series(x).where(condition, pd.Series(y)).array | ||
|
||
|
||
class PandasExtensionArray(Generic[T_ExtensionArray]): | ||
array: T_ExtensionArray | ||
|
||
def __init__(self, array: T_ExtensionArray): | ||
"""NEP-18 compliant wrapper for pandas extension arrays. | ||
|
||
Parameters | ||
---------- | ||
array : T_ExtensionArray | ||
The array to be wrapped upon e.g,. :py:class:`xarray.Variable` creation. | ||
``` | ||
""" | ||
if not isinstance(array, pd.api.extensions.ExtensionArray): | ||
raise TypeError(f"{array} is not an pandas ExtensionArray.") | ||
self.array = array | ||
|
||
def __array_function__(self, func, types, args, kwargs): | ||
def replace_duck_with_extension_array(args) -> list: | ||
args_as_list = list(args) | ||
for index, value in enumerate(args_as_list): | ||
if isinstance(value, PandasExtensionArray): | ||
args_as_list[index] = value.array | ||
elif isinstance( | ||
value, tuple | ||
): # should handle more than just tuple? iterable? | ||
args_as_list[index] = tuple( | ||
replace_duck_with_extension_array(value) | ||
) | ||
elif isinstance(value, list): | ||
args_as_list[index] = replace_duck_with_extension_array(value) | ||
return args_as_list | ||
|
||
args = tuple(replace_duck_with_extension_array(args)) | ||
if func not in HANDLED_EXTENSION_ARRAY_FUNCTIONS: | ||
return func(*args, **kwargs) | ||
res = HANDLED_EXTENSION_ARRAY_FUNCTIONS[func](*args, **kwargs) | ||
if is_extension_array_dtype(res): | ||
return type(self)[type(res)](res) | ||
return res | ||
|
||
def __array_ufunc__(ufunc, method, *inputs, **kwargs): | ||
return ufunc(*inputs, **kwargs) | ||
|
||
def __repr__(self): | ||
return f"{type(self)}(array={repr(self.array)})" | ||
|
||
def __getattr__(self, attr: str) -> object: | ||
return getattr(self.array, attr) | ||
|
||
def __getitem__(self, key) -> PandasExtensionArray[T_ExtensionArray]: | ||
item = self.array[key] | ||
if is_extension_array_dtype(item): | ||
return type(self)(item) | ||
if np.isscalar(item): | ||
return type(self)(type(self.array)([item])) | ||
return item | ||
|
||
def __setitem__(self, key, val): | ||
self.array[key] = val | ||
|
||
def __eq__(self, other): | ||
if np.isscalar(other): | ||
other = type(self)(type(self.array)([other])) | ||
if isinstance(other, PandasExtensionArray): | ||
return self.array == other.array | ||
return self.array == other | ||
dcherian marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
def __ne__(self, other): | ||
return ~(self == other) | ||
|
||
def __len__(self): | ||
return len(self.array) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling
.join()
in a loop will make this method take quadratic time. Can you rewrite this to join all the extension arrays together once, e.g., withpd.concat
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas-dev/pandas#57676 Not sure what to do. I don't think
concat
is meant for this? In any case very open to other ideas!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also not sure
join
with a list is faster now that I think of it. I couldn't figure out how to doconcat
though...maybe I should make the index on theextension_array_df
the correct multi-index but this seems tricky?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good to sort this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer Could you maybe give some details on using
concat
here? I think we truly do want a join, no?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's open an issue to remind ourselves to make this more efficient.
I guess the core problem is that extension arrays cannot be broadcast to nD with
.set_dims
? Maybe we could raise an error iflen(ordered_dims) > 1
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#8950 done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is true.
I think this currently handles the case where this is >1 so why error out? I think
join
is acceptable here IMO