Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map_over_datasets: skip empty nodes #10042

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

mathause
Copy link
Collaborator

@mathause mathause commented Feb 10, 2025


  • misses tests and docs but I'd like to get some feedback first
  • needs some add. logic to only check the output on non-empty nodes and to ensure multi-output functions are correct
  • no good way for a proper deprecation without a keyword

@mathause mathause marked this pull request as draft February 10, 2025 17:37
@Illviljan
Copy link
Contributor

A interpolation use case that doesn't crash with this PR:

import numpy as np

import xarray as xr

number_of_files = 700
number_of_groups = 5
number_of_variables = 10

datasets = {}
for f in range(number_of_files):
    for g in range(number_of_groups):
        # Create random data
        time = np.linspace(0, 50 + f, 1 + 1000 * g)
        y = f * time + g

        # Create dataset:
        ds = xr.Dataset(
            data_vars={
                f"temperature_{g}{i}": ("time", y)
                for i in range(number_of_variables // number_of_groups)
            },
            coords={"time": ("time", time)},
        ).chunk()

        # Prepare for xr.DataTree:
        name = f"file_{f}/group_{g}"
        datasets[name] = ds
dt = xr.DataTree.from_dict(datasets)

# %% Interpolate to same time coordinate
def ds_interp(ds, *args, **kwargs):
    return ds.interp(*args, **kwargs)


new_time = np.linspace(0, 100, 50)
dt_interp = dt.map_over_datasets(
    ds_interp, kwargs=dict(time=new_time, assume_sorted=True)
)

@mathause
Copy link
Collaborator Author

Thanks for the example. This PR would also close #10013. This would be a huge plus for me. Not being able to subtract a ds from a datatree makes it extremely cumbersome. However, this implies that the binary ops are implemented using map_over_datasets and means there is a considerable behavior change.

@mathause
Copy link
Collaborator Author

@TomNicholas do you see any chance this PR might get merged (after adding tests etc. obviously)? Are there discussions beside #9693 that I am missing?

@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Feb 27, 2025
@TomNicholas
Copy link
Member

Hey @mathause - sorry for forgetting about this - I've been busy.

I think something like this should get merged, but there are various small and fairly arbitrary choices to quibble over. They are basically all already mentioned in #9693 though.

this implies that the binary ops are implemented using map_over_datasets and means there is a considerable behavior change.

I don't understand this statement though - aren't binary ops already implemented using map_over_datasets?

return map_over_datasets(ds_binop, self, other)

We're changing the behaviour, but changing it to be closer to the old datatree, which is what a lot of users expect anyway.

@mathause mathause closed this Mar 6, 2025
@mathause mathause reopened this Mar 6, 2025
@mathause mathause marked this pull request as ready for review March 7, 2025 04:42
@mathause
Copy link
Collaborator Author

mathause commented Mar 7, 2025

I think this is ready for review


Hey @mathause - sorry for forgetting about this - I've been busy.

No worries! Thanks for considering this PR!

I think something like this should get merged, but there are various small and fairly arbitrary choices to quibble over. They are basically all already mentioned in #9693 though.

The one unclear choice from #9693 was the comment by @shoyer #9693 (comment):

I'm not sure whether or not to call the mapped over function for nodes that only define coordinates. Certainly I would not blindly copy coordinates from otherwise empty nodes onto the result, because those coordinates may no longer be relevant on the result.

I currently use DataTree.has_data, this includes nodes that only have coords (although nodes which inherit coords are excluded (I think)). I don't see a way to be clever about these nodes.

this implies that the binary ops are implemented using map_over_datasets and means there is a considerable behavior change.

I don't understand this statement though - aren't binary ops already implemented using map_over_datasets?

Yes sorry that was not clear. I just wanted to say that binary ops are also affected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

Successfully merging this pull request may close these issues.

datatree gets dis-aligned in binary op map_over_datasets throws error on nodes without datasets
3 participants