Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint #9243

eni-awowale · 2024-07-13T22:03:30Z

Closes open_dict_of_datasets function to open any file containing nested groups #9137 and in support of Track merging datatree into xarray #8572
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

@TomNicholas, @shoyer, @owenlittlejohns, and @flamingbear

for more information, see https://pre-commit.ci

xarray/backends/h5netcdf_.py

xarray/backends/api.py

eni-awowale · 2024-07-26T16:03:12Z

I know we touched on this a bit during our meeting on Tuesday but I think tests for open_groups in netCDF_.py and h5netcdf.py should probably go in test_backends_datatree.py. We mentioned we could create separate test files for it but it looks like test_backends_datatree.py, has an empty class for each of these backends and it looks like they would fit nicely in here.

test_backends_datatree.py

@requires_netCDF4
class TestNetCDF4DatatreeIO(DatatreeIOBase):
    engine: T_DataTreeNetcdfEngine | None = "netcdf4"


@requires_h5netcdf
class TestH5NetCDFDatatreeIO(DatatreeIOBase):
    engine: T_DataTreeNetcdfEngine | None = "h5netcdf"

Let me know if I have that right and what you all think.
@TomNicholas, @shoyer, @keewis, @flamingbear, @owenlittlejohns

EDIT:
I think we could also add a couple more tests for open_datatree for those specific engines to those classes.

EDIT 2:
If it's okay with everyone I think we should also include a small netCDF file in test/data that has groups. example_1.nc and bears.nc don't have groups.

TomNicholas · 2024-07-26T16:24:59Z

Sounds good @eni-awowale

You may even be able to use basically the same groups test for multiple backends by adding it to DataTreeIOBase...

If it's okay with everyone I think we should also include a small netCDF file in test/data that has groups.

Having an included file to test against is a good idea, we just want it to be small.

example_1.nc and bears.nc don't have groups.

Where are these currently? We don't want to add files to the main repo - preferring to put them here https://github.com/pydata/xarray-data

eni-awowale · 2024-07-26T16:52:40Z

You may even be able to use basically the same groups test for multiple backends by adding it to DataTreeIOBase...

So, instead of adding the same type of tests to TestNetCDF4DatatreeIO and TestH5NetCDFDatatreeIO I can add them directly to DataTreeIOBase, if the are not engine specific? Okay, that makes sense!

@TomNicholas example_1.nc and bears.nc are here test/data.

We don't want to add files to the main repo - preferring to put them here

Okay, great there might already be some some sample files in there with groups that I can use!

TomNicholas · 2024-07-26T16:56:44Z

@TomNicholas example_1.nc and bears.nc are here test/data.

I didn't even know those existed... @max-sixty I know you have thought about example datasets for testing purposes - do you have an opinion on whether new files for testing should go in that directory or the separate repository?

shoyer · 2024-07-26T17:29:52Z

@TomNicholas example_1.nc and bears.nc are here test/data.

I didn't even know those existed... @max-sixty I know you have thought about example datasets for testing purposes - do you have an opinion on whether new files for testing should go in that directory or the separate repository?

We want tests to be runable without a network connection. If we need new files for testing, which I agree would be a good idea in this case, please add them here. Just keep them as small as possible (the current test files are all under 10 KB in size, most under 1 KB).

shoyer · 2024-07-26T17:31:19Z

Where are these currently? We don't want to add files to the main repo - preferring to put them here https://github.com/pydata/xarray-data

xarray-data is for sample/tutorial datasets, which can be much larger (enough data to make an interesting plot) than what we use for test data.

xarray/backends/h5netcdf_.py

xarray/backends/plugins.py

xarray/tests/test_backends_datatree.py

…reate netcdf4 file, on the fly

xarray/backends/common.py

TomNicholas · 2024-07-30T20:29:44Z

xarray/backends/h5netcdf_.py

+        invalid_netcdf=None,
+        phony_dims=None,
+        decode_vlen_strings=True,
+        driver=None,
+        driver_kwds=None,


These shouldn't be here right? They should all fall under `**kwargs``

Or maybe they should be in the specific backend but not in common.py?

Yeah, so these were added from PR #9199 for adding back the backend specific keyword arguments. I pulled this into my branch after it was merged to main. But they are not in common.py , they are consolidated as **kwargs in common.py.

TomNicholas · 2024-07-30T20:30:23Z

xarray/backends/h5netcdf_.py

@@ -466,19 +494,23 @@ def open_datatree(
            driver=driver,
            driver_kwds=driver_kwds,
        )
+        # Check for a group and make it a parent if it exists
        if group:
            parent = NodePath("/") / NodePath(group)


@eni-awowale this is how you should join paths

TomNicholas · 2024-07-30T20:36:02Z

xarray/backends/api.py

@@ -837,6 +837,43 @@ def open_datatree(
    return backend.open_datatree(filename_or_obj, **kwargs)


We could have a default implementation here that calls open_groups, i.e.

Suggested change

return backend.open_datatree(filename_or_obj, **kwargs)

groups_dict = backend.open_datatree(filename_or_obj, **kwargs)

return DataTree.from_dict(groups_dict)

The idea being that then backend developers don't actually have to implement open_datatree if they have implemented open_groups...

This was sort of discussed here (@keewis) #7437 (comment), but this seems like an rabbit hole that should be left for a future PR.

not really, I was arguing that given any one of open_dataarray, open_dataset and open_datatree allows us to provide (somewhat inefficient) default implementations for the others. However, open_groups has a much closer relationship to open_datatree, so I think having a default implementation for open_datatree is fine (we just need to make sure that a backend that provides neither open_groups nor open_datatree doesn't complain about open_groups not existing if you called open_datatree).

So yeah, this might become a rabbit hole.

Okay I see. That seems related, but also like a totally optional convenience feature that we should defer to later.

eni-awowale · 2024-08-08T22:10:48Z

Hi folks! I think this PR is a couple commits away from merging but they are a couple things to sort out.

`mypy` checks failing

The open_groups function returns a dict[str, Dataset] type. However, the return type for DataTree.from_dict() is a MutableMapping[str, Dataset | DataArray | DataTree | None] and we call DataTree.from_dict() in our open_datatree() backend functions, so mypy flags this as having an incompatible type. See below.
"from_dict" of "DataTree" has incompatible type "dict[str, Dataset]"; expected "MutableMapping[str, Dataset | DataArray | DataTree[Any] | None]" [arg-type]
I am not sure exactly how to proceed with this since updating the return type of DataTree.from_dict() causes other mypy checks to fail because the are expecting the MutuableMapping return type.

Some checks are failing but seem unrelated to the PR changes?

I noticed there are tests failing, but these seem to be consistent with the checks failing in main and are related to test_computations.py. The other test I see failing are the ubuntu 3.12 flaky tests. Are these failures we can ignore?

Updating `_iter_nc_groups`

I added the suggestion from @keewis so _iter_nc_groups is returning the parent node and now we don't have to reopen the store twice 🎉

cc
@TomNicholas, @keewis, @shoyer, @flamingbear

keewis · 2024-08-08T22:57:52Z

Are these failures we can ignore?

flaky is something that may fail at random, mostly because of server or network issues. Usually it will pass again on the next run (I reran it for you, let's see what happens). The other test failures are indeed also on main, so we can ignore it here.

Edit: flaky passes now, so I believe you can ignore the remaining test failures

I am not sure exactly how to proceed with this since updating the return type of DataTree.from_dict() causes other mypy checks to fail because the are expecting the MutableMapping return type.

Maybe the typing crowd can help? cc @max-sixty, @Illviljan, @headtr1ck

Illviljan · 2024-08-11T16:52:43Z

mypy checks failing

The open_groups function returns a dict[str, Dataset] type. However, the return type for DataTree.from_dict() is a MutableMapping[str, Dataset | DataArray | DataTree | None] and we call DataTree.from_dict() in our open_datatree() backend functions, so mypy flags this as having an incompatible type. See below. "from_dict" of "DataTree" has incompatible type "dict[str, Dataset]"; expected "MutableMapping[str, Dataset | DataArray | DataTree[Any] | None]" [arg-type] I am not sure exactly how to proceed with this since updating the return type of DataTree.from_dict() causes other mypy checks to fail because the are expecting the MutuableMapping return type.

It's this type of problem: https://stackoverflow.com/questions/73603289/why-doesnt-parameter-type-dictstr-unionstr-int-accept-value-of-type-di
from_dict should probably switch from MutableMapping to Mapping instead.

xarray/tests/test_backends_datatree.py

Illviljan · 2024-08-13T20:52:58Z

xarray/core/datatree.py

+        d_cast = cast(dict, d)
+        root_data = d_cast.pop("/", None)


Mypy is correct here, a Mapping does not include a .pop and ignoring the typing errors doesn't solve the bug.

xarray uses Mappings frequently, for example xr.core.utils.FrozenDict(dict(a=3)).pop("a", None) so it's a real issue.

Either you explicitly convert it to dict i.e. d_cast = dict(d) or refactor the code to not use the .pop since I'm not so sure it's needed.

Thanks, I updated it to explicitly covert the type to a dict. The mypy3.9 tests are still passing but the CI Mypy check seems to be returning the same error as before explicitly converting it to a dict.

for more information, see https://pre-commit.ci

eni-awowale · 2024-08-14T15:25:54Z

@TomNicholas and @keewis this issue we just moved over #9336 seems to be related to the latest batch of test failures.

eni-awowale · 2024-08-14T17:38:50Z

Yay all checks are passing! Does someone want to give this a quick look before merging?
cc
@TomNicholas, @keewis, @owenlittlejohns, @shoyer and @flamingbear

headtr1ck · 2024-08-14T17:45:03Z

xarray/backends/common.py

+    def open_groups_as_dict(
+        self,
+        filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
+        **kwargs: Any,


Maybe we should not make the same mistake as with open_dataset and prevent liskov errors.

Suggested change

**kwargs: Any,

If the abstract method supports any kwargs, so must all subclass implementations, which is not what we want.

@headtr1ck from my understanding I think the **kwargs were added back to fix this issue #9135

Hmmm, Not sure.
But since this is the same problem in all other backend methods, I'm fine with leaving it as it is (and possibly change it in a future PR all together).

Sounds good we can revisit this on another PR.

headtr1ck · 2024-08-14T17:46:26Z

xarray/backends/h5netcdf_.py

+        decode_vlen_strings=True,
+        driver=None,
+        driver_kwds=None,
+        **kwargs,


This should be obsolete as well, when you remove it from the abstract method.

headtr1ck · 2024-08-14T17:46:57Z

xarray/backends/netCDF4_.py

+        persist=False,
+        lock=None,
+        autoclose=False,
+        **kwargs,


* main: Improve error message for missing coordinate index (pydata#9370) Add flaky to TestNetCDF4ViaDaskData (pydata#9373) Make chunk manager an option in `set_options` (pydata#9362) Revise (pydata#9371) Remove duplicate word from docs (pydata#9367) Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint (pydata#9243)

* main: (214 commits) Adds copy parameter to __array__ for numpy 2.0 (pydata#9393) `numpy 2` compatibility in the `pydap` backend (pydata#9391) pyarrow dependency added to doc environment (pydata#9394) Extend padding functionalities (pydata#9353) refactor GroupBy internals (pydata#9389) Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274) passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377) Fix tests on big-endian systems (pydata#9380) Improve error message on `ds['x', 'y']` (pydata#9375) Improve error message for missing coordinate index (pydata#9370) Add flaky to TestNetCDF4ViaDaskData (pydata#9373) Make chunk manager an option in `set_options` (pydata#9362) Revise (pydata#9371) Remove duplicate word from docs (pydata#9367) Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint (pydata#9243) Revise (pydata#9366) Fix rechunking to a frequency with empty bins. (pydata#9364) whats-new entry for dropping python 3.9 (pydata#9359) drop support for `python=3.9` (pydata#8937) Revise (pydata#9357) ...

eni-awowale added 3 commits July 13, 2024 13:14

sandbox open groups

33ee4a9

rough implementation of open_groups

8186d86

removed unused imports

6b63704

eni-awowale marked this pull request as draft July 13, 2024 22:03

pre-commit-ci bot and others added 3 commits July 13, 2024 22:04

[pre-commit.ci] auto fixes from pre-commit.com hooks

ef01edc

for more information, see https://pre-commit.ci

oops deleted optional

08e230a

[pre-commit.ci] auto fixes from pre-commit.com hooks

e01b0fb

for more information, see https://pre-commit.ci

TomNicholas reviewed Jul 17, 2024

View reviewed changes

xarray/backends/h5netcdf_.py Outdated Show resolved Hide resolved

TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Jul 18, 2024

flamingbear reviewed Jul 23, 2024

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

This was referenced Jul 23, 2024

Track merging datatree into xarray #8572

Closed

DAS-2155 - Merge datatree documentation into main docs. #9033

Merged

eni-awowale added 4 commits July 26, 2024 18:04

commit to test from main

9e1984c

added tests and small sample file

33a71ac

merge main into open_groups

b10fa10

updated what is new

565ffb1

eni-awowale marked this pull request as ready for review July 30, 2024 14:30

eni-awowale commented Jul 30, 2024

View reviewed changes

eni-awowale added 2 commits July 30, 2024 15:51

updated: open_groups to include fullpath of group, improved test to c…

ce607e6

…reate netcdf4 file, on the fly

update float_ to float64 for numpy 2.0

eaba908

TomNicholas reviewed Jul 30, 2024

View reviewed changes

eni-awowale and others added 3 commits August 2, 2024 16:22

added pr suggestions and mypy changes

b4b9822

merge conflict plugins.py

9b9c1e7

Merge branch 'main' into open_groups

225489d

eni-awowale and others added 6 commits August 7, 2024 15:38

mypy

8c81a87

updated open_groups_dict

d2c74d6

changed return type for DataTree.from_dict

f206408

Merge branch 'main' into open_groups

5d34920

fix test failures

f72f3d2

update iter_nc_ to yield parent

2f92b5c

Illviljan reviewed Aug 11, 2024

View reviewed changes

xarray/tests/test_backends_datatree.py Outdated Show resolved Hide resolved

xarray/tests/test_backends_datatree.py Outdated Show resolved Hide resolved

xarray/tests/test_backends_datatree.py Outdated Show resolved Hide resolved

eni-awowale and others added 3 commits August 13, 2024 10:50

Merge branch 'main' into open_groups

175e287

mypy suggestions

6319678

adding casting

1466147

Illviljan reviewed Aug 13, 2024

View reviewed changes

eni-awowale and others added 4 commits August 13, 2024 17:21

explicitly convert to dict

0e3c946

Merge branch 'main' into open_groups

1ffeae5

Merge branch 'main' into open_groups

abd4981

[pre-commit.ci] auto fixes from pre-commit.com hooks

d44bf98

for more information, see https://pre-commit.ci

TomNicholas added topic-backends io API design labels Aug 14, 2024

updated to add d_cast for remaining functions

b2cf9b4

headtr1ck approved these changes Aug 14, 2024

View reviewed changes

eni-awowale merged commit 3c19231 into pydata:main Aug 14, 2024
28 checks passed

flamingbear mentioned this pull request Aug 27, 2024

DataTree.update can cause multiple root groups. #9285

Closed

5 tasks

aladinor mentioned this pull request Sep 4, 2024

open_groups function to open any zarr file containing nested groups #9430

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint #9243

Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint #9243

eni-awowale commented Jul 13, 2024 •

edited

Loading

eni-awowale commented Jul 26, 2024 •

edited

Loading

TomNicholas commented Jul 26, 2024

eni-awowale commented Jul 26, 2024

TomNicholas commented Jul 26, 2024

shoyer commented Jul 26, 2024

shoyer commented Jul 26, 2024

TomNicholas Jul 30, 2024

TomNicholas Jul 30, 2024

eni-awowale Aug 2, 2024

TomNicholas Jul 30, 2024

TomNicholas Jul 30, 2024

keewis Jul 30, 2024

TomNicholas Jul 30, 2024

eni-awowale commented Aug 8, 2024

keewis commented Aug 8, 2024

Illviljan commented Aug 11, 2024

`mypy` checks failing

Illviljan Aug 13, 2024 •

edited

Loading

eni-awowale Aug 13, 2024

eni-awowale commented Aug 14, 2024

eni-awowale commented Aug 14, 2024

headtr1ck Aug 14, 2024

eni-awowale Aug 14, 2024

headtr1ck Aug 14, 2024

eni-awowale Aug 14, 2024

headtr1ck Aug 14, 2024

headtr1ck Aug 14, 2024

		@@ -837,6 +837,43 @@ def open_datatree(
		return backend.open_datatree(filename_or_obj, **kwargs)

	return backend.open_datatree(filename_or_obj, **kwargs)
	groups_dict = backend.open_datatree(filename_or_obj, **kwargs)
	return DataTree.from_dict(groups_dict)

Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint #9243

Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint #9243

Conversation

eni-awowale commented Jul 13, 2024 • edited Loading

eni-awowale commented Jul 26, 2024 • edited Loading

TomNicholas commented Jul 26, 2024

eni-awowale commented Jul 26, 2024

TomNicholas commented Jul 26, 2024

shoyer commented Jul 26, 2024

shoyer commented Jul 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eni-awowale commented Aug 8, 2024

mypy checks failing

Some checks are failing but seem unrelated to the PR changes?

Updating _iter_nc_groups

keewis commented Aug 8, 2024

Illviljan commented Aug 11, 2024

mypy checks failing

Illviljan Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eni-awowale commented Aug 14, 2024

eni-awowale commented Aug 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eni-awowale commented Jul 13, 2024 •

edited

Loading

eni-awowale commented Jul 26, 2024 •

edited

Loading

`mypy` checks failing

Updating `_iter_nc_groups`

`mypy` checks failing

Illviljan Aug 13, 2024 •

edited

Loading