Skip to content

HDF5 support #7690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

HDF5 support #7690

wants to merge 8 commits into from

Conversation

klamike
Copy link

@klamike klamike commented Jul 18, 2025

This PR adds support for tabular HDF5 file(s) by converting each row to an Arrow table. It supports columns with the usual dtypes including up to 5-dimensional arrays as well as support for complex/compound types by splitting them into several columns. All datasets within the HDF5 file should have rows on the first dimension (groups/subgroups are still allowed). Closes #3113.

Replaces #7625 which only supports a relatively small subset of HDF5.

@klamike klamike marked this pull request as ready for review July 19, 2025 03:52
@klamike
Copy link
Author

klamike commented Jul 19, 2025

@lhoestq This is ready for review now. Note that it doesn't support all HDF5 files (and I don't think that's worth attempting)... the biggest assumption is that the first dimension of each dataset corresponds to rows in the split.

@klamike
Copy link
Author

klamike commented Jul 23, 2025

A few to-dos which I think can be left for future PRs (which I am happy to do/help with -- just this one is already huge 😄 ):

@@ -166,6 +166,7 @@
"aiohttp",
"elasticsearch>=7.17.12,<8.0.0", # 8.0 asks users to provide hosts or cloud_id when instantiating ElasticSearch(); 7.9.1 has legacy numpy.float_ which was fixed in https://github.com/elastic/elasticsearch-py/pull/2551.
"faiss-cpu>=1.8.0.post1", # Pins numpy < 2
"h5py", # FIXME: probably needs a lower bound
Copy link
Author

@klamike klamike Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"h5py", # FIXME: probably needs a lower bound
"h5py>=2.3",

Probably the most recent thing we need is the vlen/complex/compound, which according to the docs was added in 2.3. Version as of writing is 3.14

Comment on lines +119 to +122
has_zero_dims = any(_has_zero_dimensions(feature) for feature in relevant_features.values())
# FIXME: pyarrow.lib.ArrowInvalid: list_size needs to be a strict positive integer
if not has_zero_dims:
pa_table = table_cast(pa_table, self.info.features.arrow_schema)
Copy link
Author

@klamike klamike Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not ideal -- table_cast does not support zero-dim arrays, so I skip calling it if there are any zero-dim arrays in the table. Probably table_cast should just be updated to support zero-dim arrays.

end = min(start + effective_batch, num_rows)
batch_dict = {}
for path, dset in dataset_map.items():
arr = dset[start:end]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where the actual read from disk happens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loading Data from HDF files
1 participant