Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hdf5 read with broken links #904

Merged
merged 14 commits into from
Mar 11, 2025

Conversation

toqduj
Copy link
Contributor

@toqduj toqduj commented Mar 4, 2025

Fix suggestion for #888
Fixed by adding a try.. except block to catch the key error, changing it to a warning, and returning the warning as a string encapsulated in a numpy array to expose to the user.

Checklist

  • Add a Changelog entry
  • Add the ticket number which this PR closes to the comment section

toqduj and others added 11 commits March 4, 2025 16:31
Resolves issue with loading Dectris-compressed data (lzf compression if I'm not mistaken).
Added import of hdf5plugin
ignore flake8 here.
Fighting flake8
Fixed the years of previous changelog entries, and added mine.
Fixing a trailing whitespace for flake8's pleasure.
merging changes from web (hdf5plugin) with local changes (HDF5 passing).
message on the graceful handling of broken external links in HDF5 files.
…BAMresearch/tiled into fix_hdf5_read_with_broken_links

syncing web and local commits.
@toqduj
Copy link
Contributor Author

toqduj commented Mar 4, 2025

Accidentally has the fixes from #903 as well, I think.

@danielballan
Copy link
Member

danielballan commented Mar 5, 2025

I worry that this error will pass too silently, and may show up as a confusing bug far downstream, such as in some plotting code trying to operate on a busted data set.

Let me take another look at how to make tree pass over errors more cleanly.

@toqduj
Copy link
Contributor Author

toqduj commented Mar 6, 2025

I understand your concerns, would be good to list issues in an easily accessible location.

But the problems run deeper than just with the tree functionality (I'll explain in a moment), so the suggested fix solves a real practical issue with data access but at the cost of a potential increased risk of difficult troubleshooting later.

Without a fix in place, I can access some keys and data in the files, but not get keys at all when I approach the broken link:

sorted(client.get("")["MOUSE_20250124_2_104_processed_250128_092417/"].keys())
# ['entry1', 'processed', 'saxs']
sorted(client.get("")["MOUSE_20250124_2_104_processed_250128_092417/entry1"].keys())
# error
# accessing data if you happen to know where it is, seems to work: 
client.get("MOUSE_20250124_2_104_processed_250128_092417/entry1/instrument/detector00/data")[:,:,:].compute()
# array... 

With the fix in place, I can access the element keys of the files with broken links right up until the broken link, but otherwise explore the rest of the content perfectly normally:

sorted(client.get("")["MOUSE_20250124_2_104_processed_250128_092417/entry1/processing/direct_beam_profile"].keys())
# ['beam_analysis', 'data', 'instrument']

(the offending item is in 'data' and 'instrument', which links to the beam measurement file).

So the question I guess boils down to what to do with non-critical warnings about data content.. In an ideal world, we would never have broken external links, but in reality, this happens easily when data gets reorganised or copied to other locations. When this does not affect the main data content but "just" some supplementary data put there for completeness, My opinion is that this should not invalidate access to the entire file.

Is there a good way to warn users of broken data in a more obvious way?

@toqduj
Copy link
Contributor Author

toqduj commented Mar 6, 2025

(and yes, I'll fix my file structuring to add the data to the internal tree rather than rely on external links.. disk space is cheap).

@toqduj
Copy link
Contributor Author

toqduj commented Mar 6, 2025

One more comment, a similar resolution is essentially being applied here:

/Users/bpauw/Code/tiled/tiled/adapters/hdf5.py:157: UserWarning: The dataset connection is of object type, using a Python-only feature of h5py that is not supported by HDF5 in general. Read more about that feature at https://docs.h5py.org/en/stable/special.html. Consider using a fixed-length field instead. Tiled will serve an empty placeholder, unless the object is of size 1, where it will attempt to repackage the data into a numpy array.

@danielballan
Copy link
Member

The mysterious issue where things break when you approach the broken link is due to a performance optimization where the server will proactively traverse the file and send a batch of nested content in a single request. Before we added this, traversing highly nested HDF5 file (e.g. common NeXus files) was extremely slow due to the stacked-up latency of many requests with small responses.

As an aside, that behavior is enabled/controlled by this method on the Adapter.

def inlined_contents_enabled(self, depth: int) -> bool:
return depth <= INLINED_DEPTH


You make a fair point that there is precedent for a similar approach.

if value.dtype == numpy.dtype("O"):
warnings.warn(
f"The dataset {key} is of object type, using a "
"Python-only feature of h5py that is not supported by "
"HDF5 in general. Read more about that feature at "
"https://docs.h5py.org/en/stable/special.html. "
"Consider using a fixed-length field instead. "
"Tiled will serve an empty placeholder, unless the "
"object is of size 1, where it will attempt to repackage "
"the data into a numpy array."
)
check_str_dtype = h5py.check_string_dtype(value.dtype)
if check_str_dtype.length is None:
dataset_names = value.file[self._file.name + "/" + key][...][()]
if value.size == 1:
arr = numpy.array(dataset_names)
return from_dataset(arr)
return from_dataset(numpy.array([]))

Is there a good way to warn users of broken data in a more obvious way?

There is an HTTP status code 410 Gone which we've discussed adding for various reasons, such as tombstone pages for data that has been deleted for space reasons. I could imagine returning a sentinel here which the server represents to the client in a special way and then the client can represent clearly to the user, to communicate, "Yes three was data here but it is not accessible."

That would require some wide-range changes (adapter, server, and client working together) and might best be approached as part of a cohesive effort to support this kind of thing in Tiled.

So, I'd be OK with merging this PR as a stop-gap at least until we get around to anything better. What do you think?

@toqduj
Copy link
Contributor Author

toqduj commented Mar 7, 2025

That's a perfectly workable solution for me, I'll add a TODO: note to the segment to indicate that that needs to be replaced with a better solution in due time. I like the 410 Gone status code, it allows the end user to deal with the error as they see fit.

In the mean time, I'll work on fixing my data ;)

@danielballan danielballan merged commit ed90338 into bluesky:main Mar 11, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants