HDF5 support #7690

klamike · 2025-07-18T21:09:41Z

This PR adds support for tabular HDF5 file(s) by converting each row to an Arrow table. It supports columns with the usual dtypes including up to 5-dimensional arrays as well as support for complex/compound types by splitting them into several columns. All datasets within the HDF5 file should have rows on the first dimension (groups/subgroups are still allowed). Closes #3113.

Replaces #7625 which only supports a relatively small subset of HDF5.

klamike · 2025-07-19T04:04:10Z

@lhoestq This is ready for review now. Note that it doesn't support all HDF5 files (and I don't think that's worth attempting)... the biggest assumption is that the first dimension of each dataset corresponds to rows in the split.

klamike · 2025-07-23T02:11:02Z

A few to-dos which I think can be left for future PRs (which I am happy to do/help with -- just this one is already huge 😄 ):

Enum types
HDF5 io
dataset-viewer support (not sure if changes are needed with the way it is written now)

klamike · 2025-07-23T02:14:48Z

setup.py

@@ -166,6 +166,7 @@
    "aiohttp",
    "elasticsearch>=7.17.12,<8.0.0",  # 8.0 asks users to provide hosts or cloud_id when instantiating ElasticSearch(); 7.9.1 has legacy numpy.float_ which was fixed in https://github.com/elastic/elasticsearch-py/pull/2551.
    "faiss-cpu>=1.8.0.post1",  # Pins numpy < 2
+    "h5py",  # FIXME: probably needs a lower bound


Suggested change

"h5py", # FIXME: probably needs a lower bound

"h5py>=2.3",

Probably the most recent thing we need is the vlen/complex/compound, which according to the docs was added in 2.3. Version as of writing is 3.14

klamike · 2025-07-23T02:16:51Z

src/datasets/packaged_modules/hdf5/hdf5.py

+            has_zero_dims = any(_has_zero_dimensions(feature) for feature in relevant_features.values())
+            # FIXME: pyarrow.lib.ArrowInvalid: list_size needs to be a strict positive integer
+            if not has_zero_dims:
+                pa_table = table_cast(pa_table, self.info.features.arrow_schema)


This is not ideal -- table_cast does not support zero-dim arrays, so I skip calling it if there are any zero-dim arrays in the table. Probably table_cast should just be updated to support zero-dim arrays.

klamike · 2025-07-23T02:17:34Z

src/datasets/packaged_modules/hdf5/hdf5.py

+                        end = min(start + effective_batch, num_rows)
+                        batch_dict = {}
+                        for path, dset in dataset_map.items():
+                            arr = dset[start:end]


Here is where the actual read from disk happens

initial hdf5 support

23906b6

klamike mentioned this pull request Jul 18, 2025

Add documentation for PGLearn AI4OPT/ML4OPF#34

Open

klamike added 4 commits July 18, 2025 19:16

handle zero dims

5b352d1

add tests

fdadd1a

refactor type inference

94146be

refactor vlen, drop ragged, add complex/compound

b406154

klamike marked this pull request as ready for review July 19, 2025 03:52

klamike added 3 commits July 19, 2025 01:39

update tests

b12fa5e

explicit h5py dependency

cdb7f73

allow mismatched lengths if ignored

df52454

klamike commented Jul 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDF5 support #7690

HDF5 support #7690

klamike commented Jul 18, 2025 •

edited

Loading

Uh oh!

klamike commented Jul 19, 2025 •

edited

Loading

Uh oh!

klamike commented Jul 23, 2025 •

edited

Loading

Uh oh!

klamike Jul 23, 2025 •

edited

Loading

Uh oh!

klamike Jul 23, 2025 •

edited

Loading

Uh oh!

klamike Jul 23, 2025

Uh oh!

Uh oh!

HDF5 support #7690

Are you sure you want to change the base?

HDF5 support #7690

Conversation

klamike commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klamike commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klamike commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klamike Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

klamike Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

klamike Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

klamike commented Jul 18, 2025 •

edited

Loading

klamike commented Jul 19, 2025 •

edited

Loading

klamike commented Jul 23, 2025 •

edited

Loading

klamike Jul 23, 2025 •

edited

Loading

klamike Jul 23, 2025 •

edited

Loading