-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose "Coordinates" as part of Xarray's public API #7368
Conversation
- easily create an empty Indexes collection - check consistency between indexes and variables
TODO: check indexes shapes / dims for DataArray
Since its constructor can now be used publicly. Copy input mappings and check the type of input indexes.
+ add `IndexedCoordinates.from_pandas_multiindex` helper.
Drop the `indexes` argument or keep it as private API. When a `Coordinates` object is passed as `coords` argument, extract both coordinate variables and indexes and add them to the new Dataset or DataArray.
Alternatively to an
What if the |
I added midx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("one", "two"))
coords = xr.IndexedCoordinates.from_pandas_multiindex(midx, "x")
coords = coords.merge_coords({"y": [0, 1, 2]})
# Coordinates:
# * x (x) object MultiIndex
# * one (x) object 'a' 'a' 'b' 'b'
# * two (x) int64 1 2 1 2
# * y (y) int64 0 1 2
ds = xr.Dataset(coords=coords)
# <xarray.Dataset>
# Dimensions: (x: 4)
# Coordinates:
# * x (x) object MultiIndex
# * one (x) object 'a' 'a' 'b' 'b'
# * two (x) int64 1 2 1 2
# * y (y) int64 0 1 2
# Data variables:
# *empty*
Or should we just use
|
Generally this looks great to me!
My suggestion would be:
Yes, this makes more sense to me!
Yes, I also agree! This makes more sense. |
Long term, do you think it would make sense to merge together Indexes, Coordinates and IndexedCoordinates? They are sort of all containers for the same thing. |
Yes I think so. I'm actually trying to merge Ideally, I'd see class Coordinates:
def __init__(
self,
coords: Mapping[Any, Any] | None = None,
indexes: Mapping[Any, Index] | None = None,
):
# Similar to Dataset.__init__ but without the need
# to merge coords and data vars...
# Probably ok to allow more flexibility / less safety here?
...
@classmethod
from_pandas_multiindex(cls, index: pd.MultiIndex, dim: str):
... |
In the long term, I think we should refactor For now, it's worth noting that the current
|
Thanks @shoyer, I've been thinking about similar short/long term plans although so far I haven't figured out how to implement your point 3. I'll give it another try. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally forgot that you can use Protocols to solve this typing issue, nice!
Obviously need to fix the issues found by the tests :)
xarray/core/types.py
Outdated
import sys | ||
|
||
if sys.version_info >= (3, 11): | ||
from typing import Self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Finally this is merged.
This means we should be able to remove lots of the code that annotated self with a typevar (in another PR ofc).
@@ -93,10 +94,63 @@ | |||
DTypeLikeSave: Any = None | |||
|
|||
|
|||
class Alignable(Protocol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Thanks for your feedback @headtr1ck, your previous comments have been helpful. Yeah I gave it another try using |
xarray/core/coordinates.py
Outdated
|
||
@classmethod | ||
def _construct_direct( | ||
cls: type[T_Coordinates], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you use Self here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes 👍 I'll go trough the coordinate classes and see if I can replace type variables by Self.
xarray/core/coordinates.py
Outdated
|
||
@classmethod | ||
def from_pandas_multiindex( | ||
cls: type[T_Coordinates], midx: pd.MultiIndex, dim: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here?
xarray/core/coordinates.py
Outdated
return cast(T_Coordinates, results.coords) | ||
|
||
def _reindex_callback( | ||
self: T_Coordinates, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here (not sure if it works with cast though..., Can always use a type: ignore)
xarray/core/coordinates.py
Outdated
exclude_dims, | ||
exclude_vars, | ||
) | ||
return cast(T_Coordinates, aligned.coords) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not checked this, but maybe mypy is right to complain?
If you start with a DataArrayCoordinates, do you not end up with a DatasetCoordinates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you start with a DataArrayCoordinates, do you not end up with a DatasetCoordinates?
This case shouldn't happen because Dataset and DataArray have their own _reindex_callback()
called internally in Aligner
. The callback here is called when aligning a Coordinates
object, so we need to cast DatasetCoordinates
into a Coordinates
.
I should probably add a comment here to explain this.
RTD failure is real:
|
@dcherian yes this PR would enable #6633 by passing a
import xarray as xr
ds = xr.Dataset(
data_vars={"foo": ("x", [0.0, 1.0])},
coords=xr.Coordinates({"x": ("x", [1, 2])}),
)
xr.Dataset(
data_vars={"foo": ("x", [1.0, 2.0]), "x": [1, 2]},
coords=xr.Coordinates(),
) This might be slightly confusing but I don't know if it is really worth handling this special case (i.e., dimension coordinate variables in data_vars + empty Coordinates). I think we should keep promoting data variables as coordinates since |
I updated the docstrings of I took this opportunity to improve the glossary in the documentation and make the concepts of (non)-dimension / (non)-indexed coordinate more in line with the current state of Xarray since the index refactor. I'm sure there are still other inconsistent parts in the documentation, though. I'll review and address that in a follow-up PR. |
Pyright displays an info message "Self is not valid in this context" but most important this should avoid runtime errors with python < 3.11.
I just want to say this is another amazing piece of work @benbovy 👏 👏 |
@benbovy shall we merge? |
@dcherian yes! |
…lazy-array * upstream/main: (153 commits) Add HDF5 Section to read/write docs page (pydata#8012) [pre-commit.ci] pre-commit autoupdate (pydata#8014) Update interpolate_na in dataset.py (pydata#7974) improved docstring of to_netcdf (issue pydata#7127) (pydata#7947) Expose "Coordinates" as part of Xarray's public API (pydata#7368) Core team member guide (pydata#7999) join together duplicate entries in the text `repr` (pydata#7225) Update copyright year in README (pydata#8007) Allow opening datasets with nD dimenson coordinate variables. (pydata#7989) Move whats-new entry [pre-commit.ci] pre-commit autoupdate (pydata#7997) Add documentation on custom indexes (pydata#6975) Use variable name in all exceptions raised in `as_variable` (pydata#7995) Bump pypa/gh-action-pypi-publish from 1.8.7 to 1.8.8 (pydata#7994) New whatsnew section Remove future release notes before this release Update whats-new.rst for new release (pydata#7993) Remove hue_style from plot1d docstring (pydata#7925) Add new what's new section (pydata#7986) Release summary for v2023.07.0 (pydata#7979) ...
whats-new.rst
api.rst
This is a rework of #7214. It follows the suggestions made in #7214 (comment), #7214 (comment) and #7214 (comment):
indexes
argument is added toDataset.__init__
, and theindexes
argument ofDataArray.__init__
is kept private (i.e., valid only if fastpath=True)Coordinates
object is passed to a new Dataset or DataArray via thecoords
argument, both coordinate variables and indexes are copied/extracted and added to the new objectanIndexedCoordinates
subclassCoordinates
public constructors used to create Xarray coordinates and indexes from non-Xarray objects. For example, theCoordinates.from_pandas_multiindex()
class method creates a new set of index and coordinates from an existingpd.MultiIndex
.EDIT:
IndexCoordinates
has been merged withCoordinates
EDIT2: it ended up as a pretty big refactor with the promotion of
Coordinates
has a 2nd-class Xarray container that supports alignment like Dataset and DataArray. It is still quite advanced API, useful for passing coordinate variables and indexes around. Internally,Coordinates
objects are still "virtual" containers (i.e., proxies for coordinate variables and indexes stored in their corresponding DataArray or Dataset objects). For now, a "stand-alone"Coordinates
object created from scratch wraps a Dataset with no data variables.Some examples of usage:
TODO:
assign_coords
too so it has the same behavior if aCoordinates
object is passed?indexes
argument just for that purpose?We could address that later.Solution: wrap the coordinates dict in a Coordinates objects, e.g.,ds = xr.Dataset(coords=xr.Coordinates(coords_dict))
.@shoyer, @dcherian, anyone -- what do you think about the approach proposed here? I'd like to check that with you before going further with tests, docs, etc.