Skip to content

Open mfdataset enchancement #9955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

pratiman-91
Copy link

Added new argument in open_mfdataset to better handle the invalid files.

errors : {'ignore', 'raise', 'warn'}, default 'raise'
        - If 'raise', then invalid dataset will raise an exception.
        - If 'ignore', then invalid dataset will be ignored.
        - If 'warn', then a warning will be issued for each invalid dataset.

Copy link

welcome bot commented Jan 16, 2025

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@max-sixty
Copy link
Collaborator

I'm not the expert, but this looks reasonable! Any other thoughts?

Assuming no one thinks it's a bad idea, we would need tests.

Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good idea.

But the way it is implemented here seems overly complicated and repetitive.
I would suggest to revert the logic: first build up the list wrapped in a single try and then handle the three cases in the except block.

Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there.

Also, we should add tests for this.

@pratiman-91
Copy link
Author

@headtr1ck Thanks for the suggestions. I have added two tests (ignore and warn). Also, while testing, I found that a new argument broke combine="nested" due to invalid ids. I have now modified it to reflect the correct ids, and it is passing the tests. Please review the tests and the latest version.

@pratiman-91
Copy link
Author

Hi @headtr1ck, I have been thinking about the handling of ids. Current version looks like a patch work (I am not happy with it.). I think we can create ids after removing all the invalid datasets from path1d within the combine==nested block. Please let me know what do you think.
Thanks!

@pratiman-91
Copy link
Author

@max-sixty Can you please go through the PR. Thanks!

@max-sixty
Copy link
Collaborator

I'm admittedly much less familiar with this section of the code. nothing seems wrong though!

I think we should bias towards merging, so if no one has concerns then I'd vote to merge

could we fix the errors in the docs?

@pratiman-91
Copy link
Author

It seems like one test failed test_sparse_dask_dataset_repr (xarray.tests.test_sparse.TestSparseDataArrayAndDataset) . It is not related to this PR.

@max-sixty
Copy link
Collaborator

can we fix the errors in the docs?

@pratiman-91
Copy link
Author

pratiman-91 commented May 30, 2025

can we fix the errors in the docs?

I just rebased the branch. Let us see if this resolves the docs. Ahh.. found the problem.

@pratiman-91
Copy link
Author

@max-sixty all tests passed.

Copy link
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pratiman-91 Thanks for sticking with this. This seams all reasonable to me. I've added a couple comments and suggestions.

For the tests, I think it would be great to also test for "raise".

@pratiman-91
Copy link
Author

pratiman-91 commented May 30, 2025

@kmuehlbauer Thank you for the suggestions and for fixing the typos. I have made changes based on your suggestions. Please check.

For the tests, I did not write a test for "raise" since it is the default and already being tested by various other tests. If you think it would be beneficial to add another test, then I can write it.

@pratiman-91 pratiman-91 requested a review from kmuehlbauer May 30, 2025 12:26
@kmuehlbauer
Copy link
Contributor

For the tests, I did not write a test for "raise" since it is the default and already being tested by various other tests. If you think it would be beneficial to add another test, then I can write it.

Yes, you are right, this is and was the default. No need to add another test.

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented May 30, 2025

@pratiman-91 I've thought a bit about the removing of the problematic files for the nested case.

To my understanding we have two corresponding 1D lists (ids and paths1d).

Wouldn't this work to just remember the indices fro the failing files and remove those from the existing lists afterwards? Along the lines:

remove = []
for i, paths in enumerate(paths1d):
   try:
       open(...)
   except:
      if combine="nested":
          remove.append(i)
          
# later in combine nested

if remove:
    for i in sorted(remove, reverse=True):
        del ids[i]
        del paths1d[i]
    

Update, we could even use remove for the checks. No need for another variable

@pratiman-91
Copy link
Author

pratiman-91 commented May 30, 2025

@kmuehlbauer I have thought about this approach before. However, it creates a problem with a 2x2 nested case. Therefore, I have settled for recreating the IDs and paths. You can also check here: 3bfaaee

@kmuehlbauer
Copy link
Contributor

Thanks for the pointer @pratiman-91. Yes, this has a twist.

I see this is a special case for the nested combine. All our good faith doesn't help, for these cases. If the user provides a 2x2, or NxM nested list or whatever and one or more files are missing in one or more of the sublists the whole thing explodes. So I'd rather go for having "raise" in those cases as having the user see ValueError: The supplied objects do not form a hypercube because sub-lists do not have consistent lengths along dimension X.

Maybe others have some opinion here?

original.isel(x=slice(5), y=slice(4, 8)).to_netcdf(tmp3)
original.isel(x=slice(5, 10), y=slice(4, 8)).to_netcdf(tmp4)
with open_mfdataset(
[[tmp1, tmp2], ["non-existent-file.nc", tmp3, tmp4]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove tmp3 here, your test will fail. So it seems that the user has to provide the correct amount of files according to the concat/merges he wants to achieve, or do I miss something?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct @kmuehlbauer, removing tmp3 will cause the test to fail. The current implementation expects the input to align with the intended concat/merge structure. However, if a file is missing or invalid and its removal still results in a valid nested list structure, then the process should still work as expected. Otherwise, raise an error to indicate an inconsistency in the input and concat/merge not possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that. The question is, if xarray should take that maintenance burden for that IMHO unlikely or rare cornercase of nested combines. But, let's hear others opinions.

@pratiman-91
Copy link
Author

@kmuehlbauer @max-sixty Any update on this PR?
Thanks!

-Pratiman

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

better handling of invalid files in open_mfdataset
4 participants