Add hdf5 response format #1292

JPBergsma · 2022-07-29T15:48:52Z

With this PR, I want to add the hdf5 format as a response format.
This format allows numbers to be stored in binary form, which gives a smaller response and is therefore faster for datasets with a lot of numerical data. It does add some extra overhead, so for structure files with just a few atoms the response is larger than a json response. I have therefore made it optional to support this response format by adding a config parameter with which the enabled response formats can be defined.
For the future trajectory endpoint, this will be very useful.

I prefer to not add it to the trajectory PR as I think that PR is becoming quite large already.

Some things I was still wondering about:

Should I make installing the dependencies of this PR optional?
I have added doc strings, but I am not sure how I can see these are rendered properly for the documentation on the site.

I look forward to hearing your feedback.

closes #1285

…bling hdf5 response format.

… incase hdf5 is not in the enabled_response_formats.

…sed by having two different versions in setup.py

…mat.

codecov · 2022-07-29T17:57:04Z

Codecov Report

Merging #1292 (9597cca) into master (0decac5) will decrease coverage by 0.47%.
The diff coverage is 90.85%.

@@            Coverage Diff             @@
##           master    #1292      +/-   ##
==========================================
- Coverage   91.45%   90.98%   -0.48%     
==========================================
  Files          72       73       +1     
  Lines        4366     4525     +159     
==========================================
+ Hits         3993     4117     +124     
- Misses        373      408      +35

Flag	Coverage Δ
project	`90.98% <90.85%> (-0.48%)`	⬇️
validator	`91.13% <90.85%> (+0.27%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...made/server/entry_collections/entry_collections.py	`98.48% <ø> (+0.75%)`	⬆️
optimade/server/routers/utils.py	`95.90% <87.50%> (-1.35%)`	⬇️
optimade/adapters/hdf5.py	`89.76% <89.76%> (ø)`
optimade/models/jsonapi.py	`93.75% <100.00%> (+0.25%)`	⬆️
optimade/server/config.py	`93.93% <100.00%> (+0.39%)`	⬆️
optimade/server/middleware.py	`94.96% <100.00%> (+0.09%)`	⬆️
optimade/server/routers/info.py	`96.42% <100.00%> (+0.77%)`	⬆️
optimade/adapters/structures/cif.py	`84.44% <0.00%> (-15.56%)`	⬇️
optimade/adapters/structures/proteindatabank.py	`89.41% <0.00%> (-10.59%)`	⬇️
optimade/adapters/structures/utils.py	`80.00% <0.00%> (-3.85%)`	⬇️
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

… the info endpoint.

…opefully pass the docs test on Github.

ml-evs

Hi @JPBergsma, just had a quick look at the bits that affect the existing server (haven't tested or looked at the actual HDF5 stuff). I can't really speak for how useful or effective this approach is, perhaps it would be good to survey other members of the consortium at the next meeting to see who would be able to use this?

Do you still think adding trajectories (and HDF5 support) directly to optimade-python-tools is the right approach? To me it feels like you have now developed a lot of useful extra functionality, but with a separate audience, so it might be better served as its own separate package? This would give you more freedom and allow you to have your own reference server etc. Of course we could still develop it in effectively the same way (and much of the CI/docs config etc. could just be copied across to the new repo). Let me know what you think, I could help you set it up, of course (and we can delay the decision until you are happy with the functionality as it is).

I'm afraid I'm not sure when I will get the time to take a closer look at this PR a I am not currently under contract, so you may want to try to pull others in too.

ml-evs · 2022-08-04T11:09:47Z

setup.py

+        "numpy~=1.23",
+        "h5py~=3.7",


Should these both still be optional?

I am not entirely sure what you mean with this remark.

Did I not edit the setup file correctly so that the NumPy and hdf5 dependencies are not installed by default, or do you want we to change the code so that it won't give an error when these dependencies are not present?

This comment picks out the lines where numpy and h5py are listed under install_requires, which means they are always installed. You have already added them above as an extra named hdf5 so these bits are unnecessary.

I tried to remove them from "install_requires" but in that case the setup of the docker-image fails.
So I now removed the hdf5_deps instead.

The docker image is failing because you are using numpy outside of the hdf5 additions (see #1292 (comment)).

I have now added NumPy, to "install_requires", so the setup of the docker image is successful.

ml-evs · 2022-08-04T11:11:42Z

optimade/server/routers/utils.py

@@ -277,6 +278,14 @@ def get_entries(
        ),
        included=included,
    )
+    if params.response_format == "json":
+        return response_object
+    elif params.response_format == "hdf5":


Need to check whether hdf5 is also enabled in the CONFIG.enabled_response_formats too right?

(I now see that this is done in the handle_query_params, but perhaps another guard is needed here so that implementations can pick and choose which bit of the reference server they use.

I have added an extra check.

ml-evs · 2022-08-04T11:14:03Z

optimade/server/middleware.py

+        if response.raw_headers[1][1] == b"application/vnd.api+json":
+            body = body.decode(charset)


Is this always guaranteed to be at [1][1]? Probably better to check via the header keys.

Good point, I have changed the code, so it now loops over all entries in the header.

ml-evs · 2022-08-04T11:16:15Z

optimade/server/routers/info.py

-    output_fields_by_format = {"json": list(properties.keys())}
+    output_fields_by_format = {}
+    for outputformat in CONFIG.enabled_response_formats:
+        output_fields_by_format[outputformat] = list(properties.keys())


Suggested change

output_fields_by_format[outputformat] = list(properties.keys())

output_fields_by_format[outputformat] = list(properties)

.keys() is unnecessary if you just want a list of all keys (I see we use it above too, could be removed)

I have removed the unnecessary .keys() from this file.
It would probably be good to do a regex search for the "list(*.keys()" pattern, so we can remove these in all our code.

ml-evs · 2022-08-04T11:17:01Z

optimade/server/config.py

@@ -280,6 +280,10 @@ class ServerConfig(BaseSettings):
        True,
        description="If True, the server will check whether the query parameters given in the request are correct.",
    )
+    enabled_response_formats: Optional[List[str]] = Field(


Should make an enum of supported formats, then do Optional[List[SupportedFormats]] like some of the other options

Ok, I am trying to do this, but it does make things more complicated because I now have to convert the enums to a string before I can do the comparisons in my code. It would be easier to use a Literal["json", "hdf5"] instead.

I am now using an ENUM class to restrict which values can be specified for enabled_response_formats.

ml-evs · 2022-08-04T11:19:49Z

optimade/models/jsonapi.py

+            numpy.int32: lambda v: int(v),
+            numpy.float32: lambda v: float(v),
+            numpy.int64: lambda v: int(v),
+            numpy.float64: lambda v: float(v),
+            numpy.bool_: lambda v: bool(v),
+            numpy.ndarray: lambda v: v.tolist(),


This seems to introduce a mandatory dependency on numpy. I would suggest that the HDF5Response is in a separate module and inherits from the JSON:API one. In the best case, it will just contain this additional config, but it may also make it easier to modify where necessary.

This is not directly related to the hdf5 format, so it would be strange to place it in a HDF5Response.
I want to be able to handle NumPy numbers internally, so the format of the numbers does not need to change when they are read from a file.

I can make it so, that these encoders are only loaded when NumPy is present. However, I am not sure how we should indicate optional dependencies in setup.py or requirements.txt.

JPBergsma · 2022-08-04T17:30:23Z

perhaps it would be good to survey other members of the consortium at the next meeting to see who would be able to use this?

We can discuss it at the next OPTIMADE meeting in September.

Do you still think adding trajectories (and HDF5 support) directly to optimade-python-tools is the right approach?

I could make a separate version for the trajectory stuff. But I am wondering whether this is worth it. The question is: does the extra code for the trajectory endpoint slow things down? I do not think this is the case, although I have not checked it.

Having two different python tools may also be confusing.

There may also be cases where a server has both trajectory and structure data. A structure could for example be referred to by a trajectory, as it could be the input or output of a simulation. So I think I would want to keep all the functionality of the present optimade-python-tools.

I am planning to make a separate fork for the database of the IRB, where all the code that is specific for that database will be stored.

… json.

ml-evs · 2022-08-04T18:13:52Z

I could make a separate version for the trajectory stuff. But I am wondering whether this is worth it. The question is: does the extra code for the trajectory endpoint slow things down? I do not think this is the case, although I have not checked it.

As long as it is not enabled by default/can be disabled then there will be no speed penalty, but it may overcomplicate adoption for the majority of databases that will not serve trajectories. This would not be a separate fork of optimade-python-tools, but an extension that contains the relevant bits.

Having two different python tools may also be confusing.
There may also be cases where a server has both trajectory and structure data. A structure could for example be referred to by a trajectory, as it could be the input or output of a simulation. So I think I would want to keep all the functionality of the present optimade-python-tools.

Sure, I'm imagining the hypothetical optimade-python-trajectories would depend on optimade-python-tools and would just contain trajectory-specific stuff and a copy of the reference server that enables the trajectories endpoint. I don't think it is useful to add trajectories to the default reference server given how specific they are.

This reverts commit fbfe0f7.

…of strings.

…se_formats.

…/JPBergsma/optimade-python-tools into JPBergsma/add_HDF5_output_format

JPBergsma · 2022-08-10T18:06:08Z

@ml-evs I have processed your remarks.

Some checks still have to be made to check whether a feature is enabled. So it would take a bit of time, but not much.

So far I have not implemented the trajectories with the idea of turning it into a separate package, so the code is still quite interwoven with the current optimade-python-tools. I also do not yet have a clear picture of what I would have to do to make it into a separate package. The Trajectories #1065 and hdf5 #1292 PR's also make changes at quite a low level such as in the middleware and server.main.py.

It may therefore be quite some work to convert it into an extra package.

…f the hdf5 file.

JPBergsma · 2022-09-15T15:24:00Z

The program vitables gives an error when I try to open the generated hdf5 file. So I first want to figure out why this is happening before I merge this PR.

put requirements in alphabetical order

JPBergsma · 2022-09-21T13:55:42Z

I am still not satisfied with the size of the files I get. I am now checking whether the hdf5 file becomes smaller when I use attributes instead of datasets to store the smaller optimade fields.

JPBergsma added 11 commits July 28, 2022 12:48

Added support for returning optimade data in the hdf5 format.

e2d8010

Added extra doctstrings to hdf5.py and made setting for enabling/disa…

079bd71

…bling hdf5 response format.

Added dependancies for hdf5 response to requirements.txt and setup.py.

0b71e9e

Added enabled_response_formats to test config and disabled hdf5 tests…

9167351

… incase hdf5 is not in the enabled_response_formats.

Added enabled_response_formats to test config and disabled hdf5 tests…

7551132

… incase hdf5 is not in the enabled_response_formats.

Merge branch 'master' into JPBergsma/add_HDF5_output_format

e43297e

merges changes from master.

d811457

checking whether the not installing of numpy on github server was cau…

7952092

…sed by having two different versions in setup.py

added hdf5_deps to extras_require.

694894f

Added numpy and h5py to install_requirements in setup.py

8d51f55

Use a query that does not have an _exampl_ field to test response for…

12b79e0

…mat.

JPBergsma added 2 commits August 3, 2022 19:26

Added extra test and the supported response formats are now listed at…

9fe4dcc

… the info endpoint.

Made some changes to the docstrings and type definitions so it will h…

1981032

…opefully pass the docs test on Github.

JPBergsma marked this pull request as ready for review August 4, 2022 10:06

JPBergsma requested review from CasperWA and ml-evs as code owners August 4, 2022 10:06

ml-evs reviewed Aug 4, 2022

View reviewed changes

The test for the single entry point did not work. This is fixed now

79b48d6

Added more thorough check to see whetehr the response contnet type is…

687ea78

… json.

JPBergsma added 8 commits August 4, 2022 22:07

Remove numpy and h5py from 'install_requires'.

fbfe0f7

Revert "Remove numpy and h5py from 'install_requires'."

a55bd82

This reverts commit fbfe0f7.

Remove h5py_deps and put numpy and h5py back in install_requires.

43e326f

Processed comments from code review.

1e7e3f9

Fixed test_response_format.py

50cacf0

Added extra test values, and added support for handling nested lists …

82f2b31

…of strings.

Merge branch 'master' into JPBergsma/add_HDF5_output_format

15770f9

Added extra test to check if response_format is in the enabled_respon…

42864cb

…se_formats.

Merge branch 'JPBergsma/add_HDF5_output_format' of https://github.com…

7c6a562

…/JPBergsma/optimade-python-tools into JPBergsma/add_HDF5_output_format

JPBergsma added 2 commits August 15, 2022 15:53

Added filenames to the header.

30af05a

Changed the way the collection name is determined for the file name o…

47fa9ad

…f the hdf5 file.

JPBergsma added the on-hold For PRs/issues that are on-hold for an unspecified time label Sep 15, 2022

JPBergsma added 2 commits September 15, 2022 17:24

Merge branch 'master' into JPBergsma/add_HDF5_output_format

9ef6b05

Update requirements.txt

4ada284

put requirements in alphabetical order

JPBergsma removed the on-hold For PRs/issues that are on-hold for an unspecified time label Sep 18, 2022

updated version requirement numpy in requirements.txt

f1c309d

JPBergsma added the on-hold For PRs/issues that are on-hold for an unspecified time label Sep 21, 2022

JPBergsma added 2 commits September 21, 2022 18:02

Small fields are now stored as attributes rather than datasets.

b32278f

Merge branch 'master' into JPBergsma/add_HDF5_output_format

9597cca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hdf5 response format #1292

Add hdf5 response format #1292

JPBergsma commented Jul 29, 2022 •

edited

Loading

codecov bot commented Jul 29, 2022 •

edited

Loading

ml-evs left a comment

ml-evs Aug 4, 2022

JPBergsma Aug 4, 2022

ml-evs Aug 4, 2022

JPBergsma Aug 4, 2022

ml-evs Aug 4, 2022

JPBergsma Aug 10, 2022

ml-evs Aug 4, 2022

ml-evs Aug 4, 2022

JPBergsma Aug 10, 2022

ml-evs Aug 4, 2022

JPBergsma Aug 4, 2022

ml-evs Aug 4, 2022

JPBergsma Aug 10, 2022

ml-evs Aug 4, 2022

JPBergsma Aug 5, 2022

JPBergsma Aug 10, 2022

ml-evs Aug 4, 2022

JPBergsma Aug 5, 2022

JPBergsma commented Aug 4, 2022

ml-evs commented Aug 4, 2022

JPBergsma commented Aug 10, 2022 •

edited

Loading

JPBergsma commented Sep 15, 2022

JPBergsma commented Sep 21, 2022

		if response.raw_headers[1][1] == b"application/vnd.api+json":
		body = body.decode(charset)

	output_fields_by_format[outputformat] = list(properties.keys())
	output_fields_by_format[outputformat] = list(properties)

Add hdf5 response format #1292

Are you sure you want to change the base?

Add hdf5 response format #1292

Conversation

JPBergsma commented Jul 29, 2022 • edited Loading

codecov bot commented Jul 29, 2022 • edited Loading

Codecov Report

ml-evs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JPBergsma commented Aug 4, 2022

ml-evs commented Aug 4, 2022

JPBergsma commented Aug 10, 2022 • edited Loading

JPBergsma commented Sep 15, 2022

JPBergsma commented Sep 21, 2022

JPBergsma commented Jul 29, 2022 •

edited

Loading

codecov bot commented Jul 29, 2022 •

edited

Loading

JPBergsma commented Aug 10, 2022 •

edited

Loading