From 9e3d3bdcde2cb5a005b3d2282bac1c70ed97a62e Mon Sep 17 00:00:00 2001 From: Tom Nicholas Date: Sun, 13 Oct 2024 11:53:43 -0600 Subject: [PATCH] Datatree alignment docs (#9501) * remove too-long underline * draft section on data alignment * fixes * draft section on coordinate inheritance * various improvements * more improvements * link from other page * align call include all 3 datasets * link back to use cases * clarification * small improvements * remove TODO after #9532 * add todo about #9475 * correct xr.align example call * add links to netCDF4 documentation * Consistent voice Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com> * keep indexes in lat lon selection to dodge #9475 * unpack generator properly Co-authored-by: Stephan Hoyer * ideas for next section * briefly summarize what alignment means * clarify that it's the data in each node that was previously unrelated * fix incorrect indentation of code block * display the tree with redundant coordinates again * remove content about non-inherited coords for a follow-up PR * remove todo * remove todo now that aggregations are re-implemented * remove link to (unmerged) migration guide * remove todo about improving error message * correct statement in data-structures docs * fix internal link --------- Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com> Co-authored-by: Stephan Hoyer --- doc/user-guide/data-structures.rst | 3 +- doc/user-guide/hierarchical-data.rst | 151 ++++++++++++++++++++++++++- 2 files changed, 151 insertions(+), 3 deletions(-) diff --git a/doc/user-guide/data-structures.rst b/doc/user-guide/data-structures.rst index 3a6e698f754..e5e89b0fbbd 100644 --- a/doc/user-guide/data-structures.rst +++ b/doc/user-guide/data-structures.rst @@ -771,7 +771,7 @@ Here there are four different coordinate variables, which apply to variables in ``station`` is used only for ``weather`` variables ``lat`` and ``lon`` are only use for ``satellite`` images -Coordinate variables are inherited to descendent nodes, which means that +Coordinate variables are inherited to descendent nodes, which is only possible because variables at different levels of a hierarchical DataTree are always aligned. Placing the ``time`` variable at the root node automatically indicates that it applies to all descendent nodes. Similarly, ``station`` is in the base @@ -800,6 +800,7 @@ included by default unless you exclude them with the ``inherit`` flag: dt2["/weather/temperature"].to_dataset(inherit=False) +For more examples and further discussion see :ref:`alignment and coordinate inheritance `. .. _coordinates: diff --git a/doc/user-guide/hierarchical-data.rst b/doc/user-guide/hierarchical-data.rst index 84016348676..4b3a7260567 100644 --- a/doc/user-guide/hierarchical-data.rst +++ b/doc/user-guide/hierarchical-data.rst @@ -1,7 +1,7 @@ -.. _hierarchical-data: +.. _userguide.hierarchical-data: Hierarchical data -============================== +================= .. ipython:: python :suppress: @@ -15,6 +15,8 @@ Hierarchical data %xmode minimal +.. _why: + Why Hierarchical Data? ---------------------- @@ -644,3 +646,148 @@ We could use this feature to quickly calculate the electrical power in our signa power = currents * voltages power + +.. _hierarchical-data.alignment-and-coordinate-inheritance: + +Alignment and Coordinate Inheritance +------------------------------------ + +.. _data-alignment: + +Data Alignment +~~~~~~~~~~~~~~ + +The data in different datatree nodes are not totally independent. In particular dimensions (and indexes) in child nodes must be exactly aligned with those in their parent nodes. +Exact aligment means that shared dimensions must be the same length, and indexes along those dimensions must be equal. + +.. note:: + If you were a previous user of the prototype `xarray-contrib/datatree `_ package, this is different from what you're used to! + In that package the data model was that the data stored in each node actually was completely unrelated. The data model is now slightly stricter. + This allows us to provide features like :ref:`coordinate-inheritance`. + +To demonstrate, let's first generate some example datasets which are not aligned with one another: + +.. ipython:: python + + # (drop the attributes just to make the printed representation shorter) + ds = xr.tutorial.open_dataset("air_temperature").drop_attrs() + + ds_daily = ds.resample(time="D").mean("time") + ds_weekly = ds.resample(time="W").mean("time") + ds_monthly = ds.resample(time="ME").mean("time") + +These datasets have different lengths along the ``time`` dimension, and are therefore not aligned along that dimension. + +.. ipython:: python + + ds_daily.sizes + ds_weekly.sizes + ds_monthly.sizes + +We cannot store these non-alignable variables on a single :py:class:`~xarray.Dataset` object, because they do not exactly align: + +.. ipython:: python + :okexcept: + + xr.align(ds_daily, ds_weekly, ds_monthly, join="exact") + +But we :ref:`previously said ` that multi-resolution data is a good use case for :py:class:`~xarray.DataTree`, so surely we should be able to store these in a single :py:class:`~xarray.DataTree`? +If we first try to create a :py:class:`~xarray.DataTree` with these different-length time dimensions present in both parents and children, we will still get an alignment error: + +.. ipython:: python + :okexcept: + + xr.DataTree.from_dict({"daily": ds_daily, "daily/weekly": ds_weekly}) + +This is because DataTree checks that data in child nodes align exactly with their parents. + +.. note:: + This requirement of aligned dimensions is similar to netCDF's concept of `inherited dimensions `_, as in netCDF-4 files dimensions are `visible to all child groups `_. + +This alignment check is performed up through the tree, all the way to the root, and so is therefore equivalent to requiring that this :py:func:`~xarray.align` command succeeds: + +.. code:: python + + xr.align(child.dataset, *(parent.dataset for parent in child.parents), join="exact") + +To represent our unalignable data in a single :py:class:`~xarray.DataTree`, we must instead place all variables which are a function of these different-length dimensions into nodes that are not direct descendents of one another, e.g. organize them as siblings. + +.. ipython:: python + + dt = xr.DataTree.from_dict( + {"daily": ds_daily, "weekly": ds_weekly, "monthly": ds_monthly} + ) + dt + +Now we have a valid :py:class:`~xarray.DataTree` structure which contains all the data at each different time frequency, stored in a separate group. + +This is a useful way to organise our data because we can still operate on all the groups at once. +For example we can extract all three timeseries at a specific lat-lon location: + +.. ipython:: python + + dt.sel(lat=75, lon=300) + +or compute the standard deviation of each timeseries to find out how it varies with sampling frequency: + +.. ipython:: python + + dt.std(dim="time") + +.. _coordinate-inheritance: + +Coordinate Inheritance +~~~~~~~~~~~~~~~~~~~~~~ + +Notice that in the trees we constructed above there is some redundancy - the ``lat`` and ``lon`` variables appear in each sibling group, but are identical across the groups. + +.. ipython:: python + + dt + +We can use "Coordinate Inheritance" to define them only once in a parent group and remove this redundancy, whilst still being able to access those coordinate variables from the child groups. + +.. note:: + This is also a new feature relative to the prototype `xarray-contrib/datatree `_ package. + +Let's instead place only the time-dependent variables in the child groups, and put the non-time-dependent ``lat`` and ``lon`` variables in the parent (root) group: + +.. ipython:: python + + dt = xr.DataTree.from_dict( + { + "/": ds.drop_dims("time"), + "daily": ds_daily.drop_vars(["lat", "lon"]), + "weekly": ds_weekly.drop_vars(["lat", "lon"]), + "monthly": ds_monthly.drop_vars(["lat", "lon"]), + } + ) + dt + +This is preferred to the previous representation because it now makes it clear that all of these datasets share common spatial grid coordinates. +Defining the common coordinates just once also ensures that the spatial coordinates for each group cannot become out of sync with one another during operations. + +We can still access the coordinates defined in the parent groups from any of the child groups as if they were actually present on the child groups: + +.. ipython:: python + + dt.daily.coords + dt["daily/lat"] + +As we can still access them, we say that the ``lat`` and ``lon`` coordinates in the child groups have been "inherited" from their common parent group. + +If we print just one of the child nodes, it will still display inherited coordinates, but explicitly mark them as such: + +.. ipython:: python + + print(dt["/daily"]) + +This helps to differentiate which variables are defined on the datatree node that you are currently looking at, and which were defined somewhere above it. + +We can also still perform all the same operations on the whole tree: + +.. ipython:: python + + dt.sel(lat=[75], lon=[300]) + + dt.std(dim="time")