Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate iterators.py for datatree. #8879

Merged
merged 12 commits into from
Apr 11, 2024

Conversation

owenlittlejohns
Copy link
Contributor

@owenlittlejohns owenlittlejohns commented Mar 26, 2024

This PR continues the overall work of migrating DataTree into xarray.

iterators.py does not have direct tests. In discussions with @TomNicholas and @flamingbear, we concurred that other unit tests utilise this functionality.

  • Closes migration step for iterators.py Track merging datatree into xarray #8572
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

from abc import abstractmethod
from collections import abc
from typing import Callable, Iterator, List, Optional
from collections.abc import Iterator
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looked a bit unexpected, but typing.Iterator is a deprecated alias for collections.abc.Iterator

xarray/core/iterators.py Outdated Show resolved Hide resolved
@@ -11,9 +14,9 @@ class AbstractIter(abc.Iterator):
def __init__(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering this __init__, using @dataclass could be an option here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Illviljan - sorry for not replying to this comment before. In the latest PR, the AbstractIter class was removed in favour or importing directly from anytree, so I think this comment is now addressed. Let me know if not, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, LevelOrderIter can still use @dataclass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't familiar with this decorator prior to this PR, but it looks like it generates some magic methods for the class. I'm a little wary of adding them (for example, do we want LevelOrderIter.__eq__?). Is there a strong argument here for this class needing them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make this a dataclass, but I don't think we need to bother. The main advantage of using dataclass is automatically defining a bunch of methods that you know you want, but here we aren't defining a bunch of property methods / comparison operators so it wouldn't save us many lines of code. It's also nice to be able to just say "this came directly from anytree as-is".

xarray/core/iterators.py Outdated Show resolved Hide resolved
xarray/core/iterators.py Outdated Show resolved Hide resolved
@TomNicholas
Copy link
Member

As this module comes directly from the anytree library, we might need to copy its license into xarray/licenses.

@Illviljan Illviljan added the topic-DataTree Related to the implementation of a DataTree class label Mar 27, 2024

return iterators.PreOrderIter(self)
return PreOrderIter(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we currently have two patterns of iteration used: depth-first (PreOrderIter) and breadth-first (LevelOrderIter). It looks like PreOrderIter is used in TreeNode.subtree, and LevelOrderIter is used in mapping.diff_treestructure. diff_treestructure is called inside check_isomorphic, and every other time we iterate over the tree it just calls .subtree.

Why the distinction? In diff_treestructure when comparing two trees with multiple points at which their structure deviates, the order of iteration will affect which deviation is raised as an error first. @flamingbear and I think here is it more intuitive to raise errors from the top-level of the tree first, so we agree that LevelOrderIter is the right choice here.

For mapping over the nodes of the tree in .subtree, I had previously thought that there was a good reason to use PreOrderIter. My reasoning was to do with how dt['/a/b/c/d/'] = data would immediately create the nodes /a, /a/b, /a/b/c even if they didn't already exist. But actually I don't think that's relevant here.

Another reason to use PreOrderIter might be around using map_over_subtree to map an operation that then fails. For example taking dt.mean(dim='time') when some of the nodes doesn't have a time dimension. Again the order of iteration affects which node will raise the first error.

However, if we actually think that LevelOrderIter is fine for the .subtree case too, we can potentially simplify this PR considerably by replacing the whole AbstractIter/PreOrderIter/LevelOrderIter framework with just a single iterate_breadth_first function that returns an iterator, and use that in all cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also then we wouldn't need the anytree license)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomNicholas - I've made some updates (or rather nabbed some work @flamingbear did and made some mypy tweaks).

A couple of things here:

  • This now contains the single class, but hasn't wrapped it in a function. I think it gets at what you were after, but just FYI. (Although, part of me wonders if this now warrants a module all of it's own?)
  • I left the anytree license in the PR, because the LevelOrderIter is still 99% the code from anytree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now contains the single class, but hasn't wrapped it in a function.

That's fine, thanks.

(Although, part of me wonders if this now warrants a module all of it's own?)

If you wanted to move it to treenode.py instead that would also make sense. But I don't have a strong opinion, and it's trivial to change later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - I'm happy to leave this in it's own module for now.

@TomNicholas TomNicholas mentioned this pull request Apr 9, 2024
27 tasks
Comment on lines 13 to 14
"""Iterate over tree applying level-order strategy starting at `node`.
This is the iterator used by `DataTree` to traverse nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Iterate over tree applying level-order strategy starting at `node`.
This is the iterator used by `DataTree` to traverse nodes.
"""
Iterate over tree applying level-order strategy starting at `node`.
This is the iterator used by `DataTree` to traverse nodes.

for ``node``.
maxlevel : int, optional
Maximum level to descend in the node hierarchy.
Examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Examples
Examples

Comment on lines 62 to 63

"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the suggested whitespace changes in this commit.

Well I made 2.5 of them - I left the first line of the class documentation string for LevelOrderIter on the same line, as this seems consistent with other documentation strings in the repository. I did de-dent the second line, though, because that definitely looked a bit off.

@@ -337,12 +320,12 @@ def test_descendants(self):
descendants = root.descendants
expected = [
"b",
"c",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test has no typing, why doesn't mypy complain?

New test files shouldn't be allowed to stay untyped:

"xarray.tests.*",

Copy link
Contributor Author

@owenlittlejohns owenlittlejohns Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the values of expected, I'm guessing that because it's all just built-in Python types (a list of str values) mypy can understand that (here is a random similar example from test_dataset.py.

For the places where create_test_tree is called (so root in this test), the typing is done within that test function, so my guess is that mypy is pulling the typing from there (see L242 - L250).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave worrying about mypy on tests for #8926

Comment on lines +497 to +498
"/set2",
"/set3",
Copy link
Member

@TomNicholas TomNicholas Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Interesting how few tests explicit rely on the order of iteration.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@@ -337,12 +320,12 @@ def test_descendants(self):
descendants = root.descendants
expected = [
"b",
"c",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave worrying about mypy on tests for #8926

@@ -11,9 +14,9 @@ class AbstractIter(abc.Iterator):
def __init__(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make this a dataclass, but I don't think we need to bother. The main advantage of using dataclass is automatically defining a bunch of methods that you know you want, but here we aren't defining a bunch of property methods / comparison operators so it wouldn't save us many lines of code. It's also nice to be able to just say "this came directly from anytree as-is".

@flamingbear
Copy link
Member

This looks good to me now too @TomNicholas are you waiting for another approval?

@TomNicholas TomNicholas merged commit 1d43672 into pydata:main Apr 11, 2024
31 checks passed
@owenlittlejohns owenlittlejohns deleted the DAS-2063-migrate-iterators branch April 15, 2024 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class
Projects
Development

Successfully merging this pull request may close these issues.

4 participants