-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass indexes directly to the DataArray and Dataset constructors #7214
Conversation
Big moves!! |
Hmm I'm wondering what would be best between the options below regarding the types for the
Option 1 is nice for passing multiple indexes, e.g., pd_midx1 = pd.MultiIndex.from_arrays(..., names=("one", "two"))
pd_midx2 = pd.MultiIndex.from_arrays(..., , names=("three", "four"))
indexes1 = PandasMultiIndex.from_pandas_index(pd_midx1, "x")
indexes2 = PandasMultiIndex.from_pandas_index(pd_midx2, "y")
ds = xr.Dataset(indexes=[indexes1, indexes2]) With option 1 it feels odd passing an empty list in order to avoid creating default indexes: Option 3 actually works in all cases since I'm leaning towards option 3. For passing multiple indexes at once we could probably expand the What do people think? |
Maybe with something else than |
- easily create an empty Indexes collection - check consistency between indexes and variables
TODO: check indexes shapes / dims for DataArray
I implemented option 3. We can still change or revert it later if it's not the best one. A few examples: import pandas as pd
import xarray as xr
from xarray.indexes import wrap_pandas_multiindex
midx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("one", "two")) It is now possible to pass a pandas multi-index to a Dataset like this: # this returns an `Indexes` object (indexes + coordinates)
indexes = wrap_pandas_multiindex(midx, "x")
ds = xr.Dataset(indexes=indexes)
# <xarray.Dataset>
# Dimensions: (x: 4)
# Coordinates:
# * x (x) object MultiIndex
# * one (x) object 'a' 'a' 'b' 'b'
# * two (x) int64 1 2 1 2
# Data variables:
# *empty* IMO the above should be preferred over passing it as a coordinate (should we deprecate it now?): ds_deprecated = xr.Dataset(coords={"x": midx})
ds_deprecated.identical(ds)
# True
# eventually this would behave like this:
ds_midx_as_array = xr.Dataset(coords={"x": midx})
# <xarray.Dataset>
# Dimensions: (x: 4)
# Coordinates:
# * x (x) object ('a', 1) ('a', 2) ('b', 1) ('b', 2)
# Data variables:
# *empty* We can pass indexes around from one Xarray object to another, e.g., da = xr.DataArray([1, 2, 3, 4], dims="x", indexes=ds.xindexes)
# <xarray.DataArray (x: 4)>
# array([1, 2, 3, 4])
# Coordinates:
# * x (x) object MultiIndex
# * one (x) object 'a' 'a' 'b' 'b'
# * two (x) int64 1 2 1 2 Skip creating pandas indexes for dimension coordinates: ds_noindex = xr.Dataset(coords={"x": [0, 1, 2]}, indexes={})
# <xarray.Dataset>
# Dimensions: (x: 3)
# Coordinates:
# x (x) int64 0 1 2
# Data variables:
# *empty*
ds_noindex.xindexes
# Indexes:
# *empty* |
Passing multiple indexes: midx1 = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("one", "two"))
midx2 = pd.MultiIndex.from_product([["c", "d"], [3, 4]], names=("three", "four"))
indexes1 = wrap_pandas_multiindex(midx1, "x")
indexes2 = wrap_pandas_multiindex(midx2, "y")
indexes = Indexes(
indexes=dict(**indexes1, **indexes2),
variables=dict(**indexes1.variables, **indexes2.variables)
)
ds = xr.Dataset(indexes=indexes)
# <xarray.Dataset>
# Dimensions: (x: 4, y: 4)
# Coordinates:
# * x (x) object MultiIndex
# * one (x) object 'a' 'a' 'b' 'b'
# * two (x) int64 1 2 1 2
# * y (y) object MultiIndex
# * three (y) object 'c' 'c' 'd' 'd'
# * four (y) int64 3 4 3 4
# Data variables:
# *empty* That's not looking super nice, but probably we can add some convenience function or |
Since its constructor can now be used publicly. Copy input mappings and check the type of input indexes.
I also added an ds = xr.Dataset(coords={"x": [4, 5, 6, 7]})
ds2 = xr.Dataset(coords={"x": [1, 2, 3, 4]})
ds.assign_indexes(ds2.xindexes)
# <xarray.Dataset>
# Dimensions: (x: 4)
# Coordinates:
# * x (x) int64 1 2 3 4
# Data variables:
# *empty*
midx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("one", "two"))
indexes = wrap_pandas_multiindex(midx, "x")
ds.assign_indexes(indexes)
# <xarray.Dataset>
# Dimensions: (x: 4)
# Coordinates:
# * x (x) object MultiIndex
# * one (x) object 'a' 'a' 'b' 'b'
# * two (x) int64 1 2 1 2
# Data variables:
# *empty* |
@pydata/xarray I'd be very happy if you could share your thoughts about the examples shown in the last three comments. If you think the API looks good like that, then I will work on adding some tests and on the documentation. |
How about I like the last spelling:
|
Yes that would make sense. However, it would be adding another
Indexes are not merged together but the new / replaced coordinate variables must be compatible with the other variables of the dataset. def assign_indexes(self, indexes: Indexes[Index]):
ds_indexes = Dataset(indexes=indexes)
return (
self
# prepare drop-in index / coordinate replacement
.drop_vars(indexes, errors="ignore")
# ensure the new indexes / coordinates are compatible with the Dataset
.merge(
ds_indexes,
compat="minimal", # probably not the right option?
join="override", # fastest option? (no real effect because of `drop_vars`)
combine_attrs="no_conflicts",
)
)
That is actually a good idea for #7214 (comment)! Not sure I would reuse |
elif len(indexes) == 0: | ||
create_default_indexes = False | ||
indexes = Indexes() | ||
else: | ||
create_default_indexes = True | ||
if not isinstance(indexes, Indexes): | ||
raise TypeError("non-empty indexes must be an instance of `Indexes`") | ||
elif indexes._index_type != Index: | ||
raise TypeError("indexes must only contain Xarray `Index` objects") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the special case for size 0 indexes. These sort of special cases that violate type stability can lead to very hard to debug bugs.
Instead, how about never creating default pandas indexes if indexes
is provided? Maybe there could also be a special method (e.g., assign_default_indexes()
) for manually adding in pandas indexes later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I like that!
I also like assign_default_indexes()
or maybe assign_indexes(indexes=None)
if we don't want to add another method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern I have is when a user wants to explicitly provide a pandas multi-index or a custom single coordinate index for a dimension and still wants the default pandas indexes for the other dimension coordinates.
In this case, calling assign_default_indexes()
or assign_indexes(indexes=None)
later will likely overwrite the explicitly provided indexes?
After removing the pandas multi-index dimension coordinate this won't be an issue anymore, but the issue remains for any custom 1-d dimension coordinate index.
Should we add an overwrite=True
keyword argument to assign_(default_)indexes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assign_default_indexes()
should probably include an optional list of dimension names for which to assign default indexes.
By default, we might make it only add indexes for dimensions that doesn't already have an index.
both_indexes_and_coords = set(indexes) & coord_names | ||
if both_indexes_and_coords: | ||
raise ValueError( | ||
f"{both_indexes_and_coords} are found in both indexes and coords" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check surprises me. As a user, I would expect that I can write something like xarray.Dataset(data_vars=ds.data_vars, coords=ds.coords, attrs=ds.attrs, indexes=ds.xindexes)
to make a new object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Would it be reasonable that the coordinates provided via indexes=...
(if we use it) override the coordinates provided via coords=...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be reasonable that the coordinates provided via
indexes=...
(if we use it) override the coordinates provided viacoords=...
?
Can you give an example of how this would happen? Do indexes already provide enough information to create a coordinate (this would slightly surprise me)?
I think an error in these cases or creating a Dataset with inconsistent coordinates/indexes (if it is hard to check) would be fine. I would not silently override coordinates.
I would lean against this, only because it's easier to explicitly manipulate indexes in the form of a Explicitly providing indexes is an advanced user feature. I think it's OK to require users to do a bit more work in this case and to not necessarily do consistency checks (beyond verifying that the coordinate variables exist). |
Agreed. However,
While generally I also prefer handling plain EDIT -- For more context: initially an |
An |
I'd just want to add that, from my experience with debugging multi-index issues, it is hard even for advanced users to see what's going wrong when coordinates and indexes are not consistent. |
True, but its neatly encapsulated in one place. And if we are going to support many Index classes that build on numpy, scipy, then support for pandas indexes seems entirely in-scope.
Should we expose |
I agree -- we should support this for backwards compatibility (even if we deprecate it).
OK, this totally makes sense. I don't love that it is possible to express invalid states in Xarray's data model. This motivated the creation of I wonder if we should consider the broader refactor of merging the This would have a number of benefits:
|
Thanks for the suggestion @shoyer, in general I like it very much! "Coordinates possibly baked by one or more indexes" feels much more natural than "indexes and their corresponding coordinates". Even though indexes have been promoted as 1st class citizens in the data model, their right place should still be in the background compared to coordinates. So having a My main concern is about the timing, as such a broader refactor might postpone some work in progress on the public API and the documentation. Ideally this shouldn't discourage users to start experimenting with custom indexes and building an ecosystem around it, as soon as possible. There might be a fast path towards your suggestion, at least regarding the public facing API (your points 1 and 4):
This would let us the possibility to achieve a broader (mostly internal) refactor of Alternatively, we could just wait for that refactor to finish before implementing explicit assignment of coordinates and indexes. We already have |
Superseded by #7368. |
whats-new.rst
api.rst
From #6392 (comment):
I'm thinking of only accepting one or more instances of Indexes as indexes argument in the Dataset and DataArray constructors. The only exception is when
fastpath=True
a mapping can be given directly. Also, when an empty collection of indexes is passed this skips the creation of default pandas indexes for dimension coordinates.Indexes.variables
do no conflict with the coordinate names in thecoords
argumentIndexes
object, thus with less chance to accidentally provide coordinate variables and index objects that do not relate to each other (we could probably add some safe guards in theIndexes
class itself)Index
may provide a factory method that returns an instance ofIndexes
that we just need to pass as indexes, and we could also do something likeds = xr.Dataset(indexes=other_ds.xindexes)