-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hdf5 response format #1292
base: main
Are you sure you want to change the base?
Add hdf5 response format #1292
Conversation
…bling hdf5 response format.
… incase hdf5 is not in the enabled_response_formats.
… incase hdf5 is not in the enabled_response_formats.
…sed by having two different versions in setup.py
Codecov Report
@@ Coverage Diff @@
## master #1292 +/- ##
==========================================
- Coverage 91.45% 90.98% -0.48%
==========================================
Files 72 73 +1
Lines 4366 4525 +159
==========================================
+ Hits 3993 4117 +124
- Misses 373 408 +35
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
… the info endpoint.
…opefully pass the docs test on Github.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @JPBergsma, just had a quick look at the bits that affect the existing server (haven't tested or looked at the actual HDF5 stuff). I can't really speak for how useful or effective this approach is, perhaps it would be good to survey other members of the consortium at the next meeting to see who would be able to use this?
Do you still think adding trajectories (and HDF5 support) directly to optimade-python-tools is the right approach? To me it feels like you have now developed a lot of useful extra functionality, but with a separate audience, so it might be better served as its own separate package? This would give you more freedom and allow you to have your own reference server etc. Of course we could still develop it in effectively the same way (and much of the CI/docs config etc. could just be copied across to the new repo). Let me know what you think, I could help you set it up, of course (and we can delay the decision until you are happy with the functionality as it is).
I'm afraid I'm not sure when I will get the time to take a closer look at this PR a I am not currently under contract, so you may want to try to pull others in too.
setup.py
Outdated
"numpy~=1.23", | ||
"h5py~=3.7", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these both still be optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely sure what you mean with this remark.
Did I not edit the setup file correctly so that the NumPy and hdf5 dependencies are not installed by default, or do you want we to change the code so that it won't give an error when these dependencies are not present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment picks out the lines where numpy and h5py are listed under install_requires
, which means they are always installed. You have already added them above as an extra named hdf5
so these bits are unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to remove them from "install_requires" but in that case the setup of the docker-image fails.
So I now removed the hdf5_deps instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docker image is failing because you are using numpy
outside of the hdf5 additions (see #1292 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now added NumPy, to "install_requires", so the setup of the docker image is successful.
optimade/server/routers/utils.py
Outdated
@@ -277,6 +278,14 @@ def get_entries( | |||
), | |||
included=included, | |||
) | |||
if params.response_format == "json": | |||
return response_object | |||
elif params.response_format == "hdf5": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check whether hdf5 is also enabled in the CONFIG.enabled_response_formats
too right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I now see that this is done in the handle_query_params
, but perhaps another guard is needed here so that implementations can pick and choose which bit of the reference server they use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added an extra check.
optimade/server/middleware.py
Outdated
if response.raw_headers[1][1] == b"application/vnd.api+json": | ||
body = body.decode(charset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always guaranteed to be at [1][1]
? Probably better to check via the header keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I have changed the code, so it now loops over all entries in the header.
optimade/server/routers/info.py
Outdated
output_fields_by_format = {"json": list(properties.keys())} | ||
output_fields_by_format = {} | ||
for outputformat in CONFIG.enabled_response_formats: | ||
output_fields_by_format[outputformat] = list(properties.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output_fields_by_format[outputformat] = list(properties.keys()) | |
output_fields_by_format[outputformat] = list(properties) |
.keys()
is unnecessary if you just want a list of all keys (I see we use it above too, could be removed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the unnecessary .keys() from this file.
It would probably be good to do a regex search for the "list(*.keys()" pattern, so we can remove these in all our code.
optimade/server/config.py
Outdated
@@ -280,6 +280,10 @@ class ServerConfig(BaseSettings): | |||
True, | |||
description="If True, the server will check whether the query parameters given in the request are correct.", | |||
) | |||
enabled_response_formats: Optional[List[str]] = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should make an enum of supported formats, then do Optional[List[SupportedFormats]]
like some of the other options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I am trying to do this, but it does make things more complicated because I now have to convert the enums to a string before I can do the comparisons in my code. It would be easier to use a Literal["json", "hdf5"] instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am now using an ENUM class to restrict which values can be specified for enabled_response_formats.
optimade/models/jsonapi.py
Outdated
numpy.int32: lambda v: int(v), | ||
numpy.float32: lambda v: float(v), | ||
numpy.int64: lambda v: int(v), | ||
numpy.float64: lambda v: float(v), | ||
numpy.bool_: lambda v: bool(v), | ||
numpy.ndarray: lambda v: v.tolist(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to introduce a mandatory dependency on numpy. I would suggest that the HDF5Response is in a separate module and inherits from the JSON:API one. In the best case, it will just contain this additional config, but it may also make it easier to modify where necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not directly related to the hdf5 format, so it would be strange to place it in a HDF5Response.
I want to be able to handle NumPy numbers internally, so the format of the numbers does not need to change when they are read from a file.
I can make it so, that these encoders are only loaded when NumPy is present. However, I am not sure how we should indicate optional dependencies in setup.py or requirements.txt.
We can discuss it at the next OPTIMADE meeting in September.
I could make a separate version for the trajectory stuff. But I am wondering whether this is worth it. The question is: does the extra code for the trajectory endpoint slow things down? I do not think this is the case, although I have not checked it. Having two different python tools may also be confusing. There may also be cases where a server has both trajectory and structure data. A structure could for example be referred to by a trajectory, as it could be the input or output of a simulation. So I think I would want to keep all the functionality of the present optimade-python-tools. I am planning to make a separate fork for the database of the IRB, where all the code that is specific for that database will be stored. |
As long as it is not enabled by default/can be disabled then there will be no speed penalty, but it may overcomplicate adoption for the majority of databases that will not serve trajectories. This would not be a separate fork of optimade-python-tools, but an extension that contains the relevant bits.
Sure, I'm imagining the hypothetical |
This reverts commit fbfe0f7.
…/JPBergsma/optimade-python-tools into JPBergsma/add_HDF5_output_format
@ml-evs I have processed your remarks. Some checks still have to be made to check whether a feature is enabled. So it would take a bit of time, but not much. So far I have not implemented the trajectories with the idea of turning it into a separate package, so the code is still quite interwoven with the current optimade-python-tools. I also do not yet have a clear picture of what I would have to do to make it into a separate package. The Trajectories #1065 and hdf5 #1292 PR's also make changes at quite a low level such as in the middleware and server.main.py. It may therefore be quite some work to convert it into an extra package. |
The program vitables gives an error when I try to open the generated hdf5 file. So I first want to figure out why this is happening before I merge this PR. |
put requirements in alphabetical order
I am still not satisfied with the size of the files I get. I am now checking whether the hdf5 file becomes smaller when I use attributes instead of datasets to store the smaller optimade fields. |
With this PR, I want to add the hdf5 format as a response format.
This format allows numbers to be stored in binary form, which gives a smaller response and is therefore faster for datasets with a lot of numerical data. It does add some extra overhead, so for structure files with just a few atoms the response is larger than a json response. I have therefore made it optional to support this response format by adding a config parameter with which the enabled response formats can be defined.
For the future trajectory endpoint, this will be very useful.
I prefer to not add it to the trajectory PR as I think that PR is becoming quite large already.
Some things I was still wondering about:
Should I make installing the dependencies of this PR optional?
I have added doc strings, but I am not sure how I can see these are rendered properly for the documentation on the site.
I look forward to hearing your feedback.
closes #1285