Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hdf5 response format #1292

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

JPBergsma
Copy link
Contributor

@JPBergsma JPBergsma commented Jul 29, 2022

With this PR, I want to add the hdf5 format as a response format.
This format allows numbers to be stored in binary form, which gives a smaller response and is therefore faster for datasets with a lot of numerical data. It does add some extra overhead, so for structure files with just a few atoms the response is larger than a json response. I have therefore made it optional to support this response format by adding a config parameter with which the enabled response formats can be defined.
For the future trajectory endpoint, this will be very useful.

I prefer to not add it to the trajectory PR as I think that PR is becoming quite large already.

Some things I was still wondering about:

Should I make installing the dependencies of this PR optional?
I have added doc strings, but I am not sure how I can see these are rendered properly for the documentation on the site.

I look forward to hearing your feedback.

closes #1285

@codecov
Copy link

codecov bot commented Jul 29, 2022

Codecov Report

Merging #1292 (9597cca) into master (0decac5) will decrease coverage by 0.47%.
The diff coverage is 90.85%.

@@            Coverage Diff             @@
##           master    #1292      +/-   ##
==========================================
- Coverage   91.45%   90.98%   -0.48%     
==========================================
  Files          72       73       +1     
  Lines        4366     4525     +159     
==========================================
+ Hits         3993     4117     +124     
- Misses        373      408      +35     
Flag Coverage Δ
project 90.98% <90.85%> (-0.48%) ⬇️
validator 91.13% <90.85%> (+0.27%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...made/server/entry_collections/entry_collections.py 98.48% <ø> (+0.75%) ⬆️
optimade/server/routers/utils.py 95.90% <87.50%> (-1.35%) ⬇️
optimade/adapters/hdf5.py 89.76% <89.76%> (ø)
optimade/models/jsonapi.py 93.75% <100.00%> (+0.25%) ⬆️
optimade/server/config.py 93.93% <100.00%> (+0.39%) ⬆️
optimade/server/middleware.py 94.96% <100.00%> (+0.09%) ⬆️
optimade/server/routers/info.py 96.42% <100.00%> (+0.77%) ⬆️
optimade/adapters/structures/cif.py 84.44% <0.00%> (-15.56%) ⬇️
optimade/adapters/structures/proteindatabank.py 89.41% <0.00%> (-10.59%) ⬇️
optimade/adapters/structures/utils.py 80.00% <0.00%> (-3.85%) ⬇️
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@JPBergsma JPBergsma marked this pull request as ready for review August 4, 2022 10:06
Copy link
Member

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @JPBergsma, just had a quick look at the bits that affect the existing server (haven't tested or looked at the actual HDF5 stuff). I can't really speak for how useful or effective this approach is, perhaps it would be good to survey other members of the consortium at the next meeting to see who would be able to use this?

Do you still think adding trajectories (and HDF5 support) directly to optimade-python-tools is the right approach? To me it feels like you have now developed a lot of useful extra functionality, but with a separate audience, so it might be better served as its own separate package? This would give you more freedom and allow you to have your own reference server etc. Of course we could still develop it in effectively the same way (and much of the CI/docs config etc. could just be copied across to the new repo). Let me know what you think, I could help you set it up, of course (and we can delay the decision until you are happy with the functionality as it is).

I'm afraid I'm not sure when I will get the time to take a closer look at this PR a I am not currently under contract, so you may want to try to pull others in too.

setup.py Outdated
Comment on lines 114 to 115
"numpy~=1.23",
"h5py~=3.7",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these both still be optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely sure what you mean with this remark.

Did I not edit the setup file correctly so that the NumPy and hdf5 dependencies are not installed by default, or do you want we to change the code so that it won't give an error when these dependencies are not present?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment picks out the lines where numpy and h5py are listed under install_requires, which means they are always installed. You have already added them above as an extra named hdf5 so these bits are unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to remove them from "install_requires" but in that case the setup of the docker-image fails.
So I now removed the hdf5_deps instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docker image is failing because you are using numpy outside of the hdf5 additions (see #1292 (comment)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now added NumPy, to "install_requires", so the setup of the docker image is successful.

@@ -277,6 +278,14 @@ def get_entries(
),
included=included,
)
if params.response_format == "json":
return response_object
elif params.response_format == "hdf5":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check whether hdf5 is also enabled in the CONFIG.enabled_response_formats too right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I now see that this is done in the handle_query_params, but perhaps another guard is needed here so that implementations can pick and choose which bit of the reference server they use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added an extra check.

Comment on lines 450 to 451
if response.raw_headers[1][1] == b"application/vnd.api+json":
body = body.decode(charset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always guaranteed to be at [1][1]? Probably better to check via the header keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I have changed the code, so it now loops over all entries in the header.

output_fields_by_format = {"json": list(properties.keys())}
output_fields_by_format = {}
for outputformat in CONFIG.enabled_response_formats:
output_fields_by_format[outputformat] = list(properties.keys())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output_fields_by_format[outputformat] = list(properties.keys())
output_fields_by_format[outputformat] = list(properties)

.keys() is unnecessary if you just want a list of all keys (I see we use it above too, could be removed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the unnecessary .keys() from this file.
It would probably be good to do a regex search for the "list(*.keys()" pattern, so we can remove these in all our code.

@@ -280,6 +280,10 @@ class ServerConfig(BaseSettings):
True,
description="If True, the server will check whether the query parameters given in the request are correct.",
)
enabled_response_formats: Optional[List[str]] = Field(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should make an enum of supported formats, then do Optional[List[SupportedFormats]] like some of the other options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I am trying to do this, but it does make things more complicated because I now have to convert the enums to a string before I can do the comparisons in my code. It would be easier to use a Literal["json", "hdf5"] instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now using an ENUM class to restrict which values can be specified for enabled_response_formats.

Comment on lines 369 to 374
numpy.int32: lambda v: int(v),
numpy.float32: lambda v: float(v),
numpy.int64: lambda v: int(v),
numpy.float64: lambda v: float(v),
numpy.bool_: lambda v: bool(v),
numpy.ndarray: lambda v: v.tolist(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to introduce a mandatory dependency on numpy. I would suggest that the HDF5Response is in a separate module and inherits from the JSON:API one. In the best case, it will just contain this additional config, but it may also make it easier to modify where necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not directly related to the hdf5 format, so it would be strange to place it in a HDF5Response.
I want to be able to handle NumPy numbers internally, so the format of the numbers does not need to change when they are read from a file.

I can make it so, that these encoders are only loaded when NumPy is present. However, I am not sure how we should indicate optional dependencies in setup.py or requirements.txt.

@JPBergsma
Copy link
Contributor Author

perhaps it would be good to survey other members of the consortium at the next meeting to see who would be able to use this?

We can discuss it at the next OPTIMADE meeting in September.

Do you still think adding trajectories (and HDF5 support) directly to optimade-python-tools is the right approach?

I could make a separate version for the trajectory stuff. But I am wondering whether this is worth it. The question is: does the extra code for the trajectory endpoint slow things down? I do not think this is the case, although I have not checked it.

Having two different python tools may also be confusing.

There may also be cases where a server has both trajectory and structure data. A structure could for example be referred to by a trajectory, as it could be the input or output of a simulation. So I think I would want to keep all the functionality of the present optimade-python-tools.

I am planning to make a separate fork for the database of the IRB, where all the code that is specific for that database will be stored.

@ml-evs
Copy link
Member

ml-evs commented Aug 4, 2022

I could make a separate version for the trajectory stuff. But I am wondering whether this is worth it. The question is: does the extra code for the trajectory endpoint slow things down? I do not think this is the case, although I have not checked it.

As long as it is not enabled by default/can be disabled then there will be no speed penalty, but it may overcomplicate adoption for the majority of databases that will not serve trajectories. This would not be a separate fork of optimade-python-tools, but an extension that contains the relevant bits.

Having two different python tools may also be confusing.
There may also be cases where a server has both trajectory and structure data. A structure could for example be referred to by a trajectory, as it could be the input or output of a simulation. So I think I would want to keep all the functionality of the present optimade-python-tools.

Sure, I'm imagining the hypothetical optimade-python-trajectories would depend on optimade-python-tools and would just contain trajectory-specific stuff and a copy of the reference server that enables the trajectories endpoint. I don't think it is useful to add trajectories to the default reference server given how specific they are.

@JPBergsma
Copy link
Contributor Author

JPBergsma commented Aug 10, 2022

@ml-evs I have processed your remarks.

Some checks still have to be made to check whether a feature is enabled. So it would take a bit of time, but not much.

So far I have not implemented the trajectories with the idea of turning it into a separate package, so the code is still quite interwoven with the current optimade-python-tools. I also do not yet have a clear picture of what I would have to do to make it into a separate package. The Trajectories #1065 and hdf5 #1292 PR's also make changes at quite a low level such as in the middleware and server.main.py.

It may therefore be quite some work to convert it into an extra package.

@JPBergsma JPBergsma added the on-hold For PRs/issues that are on-hold for an unspecified time label Sep 15, 2022
@JPBergsma
Copy link
Contributor Author

The program vitables gives an error when I try to open the generated hdf5 file. So I first want to figure out why this is happening before I merge this PR.

@JPBergsma JPBergsma removed the on-hold For PRs/issues that are on-hold for an unspecified time label Sep 18, 2022
@JPBergsma JPBergsma added the on-hold For PRs/issues that are on-hold for an unspecified time label Sep 21, 2022
@JPBergsma
Copy link
Contributor Author

I am still not satisfied with the size of the files I get. I am now checking whether the hdf5 file becomes smaller when I use attributes instead of datasets to store the smaller optimade fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on-hold For PRs/issues that are on-hold for an unspecified time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding option to return binary data
2 participants