Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aliases now also work for nested fields; Only retrieve data required for constructing a response from the database. #1304

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
b949945
Alow aliases to be nested fields.
JPBergsma Aug 17, 2022
7adbf4a
Added a line to docs to explain how to alias nested fields.
JPBergsma Aug 17, 2022
8c05579
Merge branch 'master' into JPBergsma/allow_nested_aliases
JPBergsma Aug 17, 2022
6c8294b
Merge branch 'master' into JPBergsma/allow_nested_aliases
JPBergsma Aug 27, 2022
0ebc724
Merge branch 'master' into JPBergsma/allow_nested_aliases
JPBergsma Sep 14, 2022
4388c9d
Merge branch 'Materials-Consortia:master' into JPBergsma/allow_nested…
JPBergsma Oct 19, 2022
c29cdbb
Merge branch 'master' into JPBergsma/allow_nested_aliases
JPBergsma Nov 1, 2022
368792a
1. Added support for more versatile aliassing allowing nested fields.…
JPBergsma Nov 11, 2022
38fcde7
Merge branch 'master' into JPBergsma/allow_nested_aliases
JPBergsma Nov 11, 2022
5cec6d6
Added type hint to alias_and_prefix function.
JPBergsma Nov 14, 2022
cbba008
Added missing..
JPBergsma Nov 14, 2022
67f2dea
Fixed bug in elastic search where the requested fields were determine…
JPBergsma Nov 15, 2022
775009d
Use https://github.com/pycqa/flake8 instead of https://gitlab.com/pyc…
JPBergsma Nov 15, 2022
1358b9d
Added pyyaml to install requirements and added python 3.11 classifier.
JPBergsma Nov 16, 2022
bb228ee
Made a small change to the descriptions of the nested dict functions …
JPBergsma Nov 16, 2022
8245884
Removed pyyaml from serverdeps as I placed it already under installre…
JPBergsma Nov 16, 2022
7b2320b
Removed get_value function from entries.py as it was no longer needed…
JPBergsma Nov 16, 2022
d87bc1b
Removed get_value function from entries.py as it was no longer needed…
JPBergsma Nov 16, 2022
6b99822
Simplified set_field_to_none_if_missing_in_dict and moved it out of t…
JPBergsma Nov 16, 2022
841520d
Moved functions related to handling nested dicts to utils.py.
JPBergsma Nov 16, 2022
c22089f
Updated docstring remove_exclude_fields and removed unneccesary brack…
JPBergsma Nov 16, 2022
1146273
solved merge conflict.
JPBergsma Nov 16, 2022
4c46f55
Updateof the explanation of the handling of nested fields
JPBergsma Nov 17, 2022
3a908bc
Merged changes from master.
JPBergsma Nov 30, 2022
1e49fb8
fix bug introduced by merge with master.
JPBergsma Nov 30, 2022
596bb73
Made changes to satisfy mypy.
JPBergsma Nov 30, 2022
703a9df
Added option to automatically set missing but required fields to the …
JPBergsma Nov 30, 2022
f174850
Removed get_non_optional_fields and get_schema as they are no longer …
JPBergsma Nov 30, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ repos:
- id: trailing-whitespace
args: [--markdown-linebreak-ext=md]

- repo: https://gitlab.com/pycqa/flake8
- repo: https://github.com/pycqa/flake8
rev: '3.9.2'
hooks:
- id: flake8
Expand Down
1 change: 1 addition & 0 deletions docs/getting_started/setting_up_an_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ Instead, if you are storing chemical formulae as an unreduced count per simulati
This would then instead require option 2 above, namely either the addition of auxiliary fields that store the correct (or mappable) OPTIMADE format in the database, or the creation of a secondary database that returns the pre-converted structures.

In the simplest case, the mapper classes can be used to define aliases between fields in the database and the OPTIMADE field name; these can be configured via the [`aliases`][optimade.server.config.ServerConfig.aliases] option as a dictionary mapping stored in a dictionary under the appropriate endpoint name, e.g. `"aliases": {"structures": {"chemical_formula_reduced": "my_chem_form"}}`, or defined as part of a custom mapper class.
Incase the alias is a nested field the field names should be separated by ".". Example: `"aliases": { "structures": {"OPTIMADE_field": "field.nested_field"}}`
JPBergsma marked this conversation as resolved.
Show resolved Hide resolved

In either option, you should now be able to insert your data into the corresponding MongoDB (or otherwise) collection.

Expand Down
4 changes: 2 additions & 2 deletions optimade/models/links.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,11 @@ class Aggregate(Enum):
class LinksResourceAttributes(Attributes):
"""Links endpoint resource object attributes"""

name: str = StrictField(
name: Optional[str] = StrictField(
...,
description="Human-readable name for the OPTIMADE API implementation, e.g., for use in clients to show the name to the end-user.",
)
description: str = StrictField(
description: Optional[str] = StrictField(
ml-evs marked this conversation as resolved.
Show resolved Hide resolved
...,
description="Human-readable description for the OPTIMADE API implementation, e.g., for use in clients to show a description to the end-user.",
)
Expand Down
7 changes: 0 additions & 7 deletions optimade/models/references.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
from pydantic import ( # pylint: disable=no-name-in-module
BaseModel,
AnyUrl,
validator,
)
from typing import List, Optional

Expand Down Expand Up @@ -264,9 +263,3 @@ class ReferenceResource(EntryResource):
queryable=SupportLevel.MUST,
)
attributes: ReferenceResourceAttributes

@validator("attributes")
def validate_attributes(cls, v):
if not any(prop[1] is not None for prop in v):
raise ValueError("reference object must have at least one field defined")
return v
8 changes: 8 additions & 0 deletions optimade/server/data/test_structures.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
},
"assemblies": null,
"chemsys": "Ac",
"dichtheid": 10.07,
"cartesian_site_positions": [
[
0.17570227444196573,
Expand Down Expand Up @@ -1222,6 +1223,7 @@
"nelements": 5,
"nsites": 24,
"pretty_formula": "Ag2C6ClH12N3",
"fancy_formulas": {"hill": "C6H12Ag2ClN3"},
"species": [
{
"chemical_symbols": [
Expand Down Expand Up @@ -1475,6 +1477,9 @@
"nelements": 5,
"nsites": 25,
"pretty_formula": "Ag2C2H2N6O13",
"fancy_formulas" : {
"hill": "C2H2Ag2N6O13"
},
"species": [
{
"chemical_symbols": [
Expand Down Expand Up @@ -1723,6 +1728,7 @@
"nelements": 7,
"nsites": 23,
"pretty_formula": "Ag2C2ClH8N5O3S2",
"fancy_formulas": {"hill": "C2H8Ag2ClN5O3S2"},
"species": [
{
"chemical_symbols": [
Expand Down Expand Up @@ -2467,6 +2473,7 @@
"nelements": 8,
"nsites": 74,
"pretty_formula": "AgB10C15Cl2H40NO3P2",
"fancy_formulas": {"hill": "C15H40AgB10Cl2NO3P2"},
"species": [
{
"chemical_symbols": [
Expand Down Expand Up @@ -2821,6 +2828,7 @@
"nelements": 7,
"nsites": 29,
"pretty_formula": "AgC3ClH14N6OS3",
"fancy_formulas":{"hill": "C3H14AgClN6OS3"},
"species": [
{
"chemical_symbols": [
Expand Down
4 changes: 1 addition & 3 deletions optimade/server/entry_collections/elasticsearch.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,9 +169,7 @@ def _run_db_query(
page_offset = criteria.get("skip", 0)
limit = criteria.get("limit", CONFIG.page_limit)

all_aliased_fields = [
self.resource_mapper.get_backend_field(field) for field in self.all_fields
]
all_aliased_fields = [field for field in criteria.get("projection", [])]
search = search.source(includes=all_aliased_fields)

elastic_sort = [
Expand Down
75 changes: 52 additions & 23 deletions optimade/server/entry_collections/entry_collections.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import re

from lark import Transformer

from functools import lru_cache
from optimade.filterparser import LarkParser
from optimade.models import EntryResource
from optimade.server.config import CONFIG, SupportedBackend
Expand All @@ -16,6 +16,7 @@
UnknownProviderProperty,
QueryParamNotUsed,
)
from optimade.utils import set_field_to_none_if_missing_in_dict


def create_collection(
Expand Down Expand Up @@ -119,9 +120,7 @@ def count(self, **kwargs: Any) -> int:

def find(
self, params: Union[EntryListingQueryParams, SingleEntryQueryParams]
) -> Tuple[
Union[List[EntryResource], EntryResource, None], int, bool, Set[str], Set[str]
]:
) -> Tuple[Union[List[EntryResource], EntryResource, None], int, bool, Set[str]]:
"""
Fetches results and indicates if more data is available.

Expand All @@ -145,6 +144,14 @@ def find(
criteria, single_entry
)

exclude_fields = self.all_fields - response_fields

results = [self.resource_mapper.map_back(doc) for doc in results]
self.check_and_add_missing_fields(results, response_fields)

if results:
results = self.resource_mapper.deserialize(results)

if single_entry:
results = results[0] if results else None

Expand All @@ -153,10 +160,20 @@ def find(
detail=f"Instead of a single entry, {data_returned} entries were found",
)

exclude_fields = self.all_fields - response_fields
return results, data_returned, more_data_available, exclude_fields

def check_and_add_missing_fields(self, results: List[dict], response_fields: set):
"""Checks whether the response_fields and mandatory fields are present.
If they are not present the values are set to None, so the deserialization works correctly.
It also checks whether all fields in the response have been defined either in the model or in the config file.
If not it raises an appropriate error or warning."""
include_fields = (
response_fields - self.resource_mapper.TOP_LEVEL_NON_ATTRIBUTES_FIELDS
)
) | set(self.get_non_optional_fields())
# Include missing fields
for result in results:
for field in include_fields:
set_field_to_none_if_missing_in_dict(result["attributes"], field)
Comment on lines +181 to +183
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to incur a significant performance overhead, but I guess you want to do it so that you don't have to pull e.g., entire trajectories from the database each time, yet you still want to deserialize the JSON into your classes? I think I would suggest we instead have a per-collection deserialization flag, as presumably you only want to deserialize trajectories once (on database insertion) anyway. Does that make sense?

If you want to retain this approach, it might be cleaner to do it at the pydantic level, e.g., a root_validator that explicitly sets all missing fields to None (see also https://pydantic-docs.helpmanual.io/usage/models/#creating-models-without-validation as an option).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is particularly heavy compared to all the other things we do in the code.
It is what we previously did in handle_response_fields. I only moved it here, so we can do it before the deserialization and added code to handle nested fields.

For biomolecular data, a structure can easily have 10,000 atoms, so retrieving them from the database and putting them in the model would take some time. This way we can avoid that if the species_at_sites and cartesian_site_positions are not in the response_fields. (I also made a patch in the code for Barcelona that allowed them to specify the default response fields, so they can choose to not have these fields in the response by default.)

I did not want to make the change even bigger by bypassing the rest of the validator (as in your second suggestion).
But from a performance viewpoint, bypassing the validation would be good.
Do you want me to add this to this PR or to a future PR ?

I tried the root validator idea, but it seems I already get an error before the root_validator is executed, so I do not think this solution will work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, fair enough, just looks a bit scarier as a double for loop alongside the recursive descent into dictionaries to get the nested aliases. It's quite hard to reason about this, so I might set up a separate repo for measuring performance in the extreme limits (1 structure of 10000 atoms vs 10000 structures of a few atoms -- i.e., what we have now, ignoring pagination of course).

I tried the root validator idea, but it seems I already get an error before the root_validator is executed, so I do not think this solution will work.

Did you use @root_validator(pre=True) to make sure it gets run before anything else? Perhaps bypassing the validation altogether can wait for another PR, as you suggest, but I'd like to make the performance tests first at least (doesn't necessarily hold up this PR but it might hold up the next release).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed I made a mistake in my test script.
So it may be possible to do this with a root_validator after all.

Copy link
Contributor Author

@JPBergsma JPBergsma Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a root_validator to the attributes class that, if a flag is set, checks whether all required fields are present and if not adds them and sets them to 0.

I'll try to put the handling of the other include fields back to the place where it happened originally so the code changes less.


bad_optimade_fields = set()
bad_provider_fields = set()
Expand All @@ -183,17 +200,6 @@ def find(
detail=f"Unrecognised OPTIMADE field(s) in requested `response_fields`: {bad_optimade_fields}."
)

if results:
results = self.resource_mapper.deserialize(results)

return (
results,
data_returned,
more_data_available,
exclude_fields,
include_fields,
)

@abstractmethod
def _run_db_query(
self, criteria: Dict[str, Any], single_entry: bool = False
Expand Down Expand Up @@ -236,6 +242,26 @@ def all_fields(self) -> Set[str]:

return self._all_fields

@lru_cache(maxsize=4)
def get_non_optional_fields(self) -> List[str]:
"""
Returns those fields that should be set before a response class can be initialized.

Returns:
Property names.
"""

schema = self.get_schema()
attributes = schema["properties"]["attributes"]
if "$ref" in attributes:
path = attributes["$ref"].split("/")[1:]
attributes = schema.copy()
while path:
next_key = path.pop(0)
attributes = attributes[next_key]
return attributes["required"]
JPBergsma marked this conversation as resolved.
Show resolved Hide resolved

@lru_cache(maxsize=4)
def get_attribute_fields(self) -> Set[str]:
"""Get the set of attribute fields

Expand All @@ -252,7 +278,7 @@ def get_attribute_fields(self) -> Set[str]:

"""

schema = self.resource_cls.schema()
schema = self.get_schema()
attributes = schema["properties"]["attributes"]
if "allOf" in attributes:
allOf = attributes.pop("allOf")
Expand All @@ -266,6 +292,10 @@ def get_attribute_fields(self) -> Set[str]:
attributes = attributes[next_key]
return set(attributes["properties"].keys())

@lru_cache(maxsize=4)
def get_schema(self):
return self.resource_cls.schema()

def handle_query_params(
self, params: Union[EntryListingQueryParams, SingleEntryQueryParams]
) -> Dict[str, Any]:
Expand Down Expand Up @@ -319,16 +349,15 @@ def handle_query_params(
cursor_kwargs["limit"] = CONFIG.page_limit

# response_fields
cursor_kwargs["projection"] = {
f"{self.resource_mapper.get_backend_field(f)}": True
for f in self.all_fields
}

if getattr(params, "response_fields", False):
response_fields = set(params.response_fields.split(","))
response_fields |= self.resource_mapper.get_required_fields()
else:
response_fields = self.all_fields.copy()
cursor_kwargs["projection"] = {
f"{self.resource_mapper.get_backend_field(f)}": True
for f in response_fields
}

cursor_kwargs["fields"] = response_fields

Expand Down
Loading