-
Notifications
You must be signed in to change notification settings - Fork 2.9k
HDF5 support #7690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
HDF5 support #7690
Conversation
@lhoestq This is ready for review now. Note that it doesn't support all HDF5 files (and I don't think that's worth attempting)... the biggest assumption is that the first dimension of each dataset corresponds to rows in the split. |
A few to-dos which I think can be left for future PRs (which I am happy to do/help with -- just this one is already huge 😄 ):
|
@@ -166,6 +166,7 @@ | |||
"aiohttp", | |||
"elasticsearch>=7.17.12,<8.0.0", # 8.0 asks users to provide hosts or cloud_id when instantiating ElasticSearch(); 7.9.1 has legacy numpy.float_ which was fixed in https://github.com/elastic/elasticsearch-py/pull/2551. | |||
"faiss-cpu>=1.8.0.post1", # Pins numpy < 2 | |||
"h5py", # FIXME: probably needs a lower bound |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"h5py", # FIXME: probably needs a lower bound | |
"h5py>=2.3", |
Probably the most recent thing we need is the vlen/complex/compound, which according to the docs was added in 2.3. Version as of writing is 3.14
has_zero_dims = any(_has_zero_dimensions(feature) for feature in relevant_features.values()) | ||
# FIXME: pyarrow.lib.ArrowInvalid: list_size needs to be a strict positive integer | ||
if not has_zero_dims: | ||
pa_table = table_cast(pa_table, self.info.features.arrow_schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not ideal -- table_cast
does not support zero-dim arrays, so I skip calling it if there are any zero-dim arrays in the table. Probably table_cast
should just be updated to support zero-dim arrays.
end = min(start + effective_batch, num_rows) | ||
batch_dict = {} | ||
for path, dset in dataset_map.items(): | ||
arr = dset[start:end] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where the actual read from disk happens
This PR adds support for tabular HDF5 file(s) by converting each row to an Arrow table. It supports columns with the usual dtypes including up to 5-dimensional arrays as well as support for complex/compound types by splitting them into several columns. All datasets within the HDF5 file should have rows on the first dimension (groups/subgroups are still allowed). Closes #3113.
Replaces #7625 which only supports a relatively small subset of HDF5.