Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aliases now also work for nested fields; Only retrieve data required for constructing a response from the database. #1304

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
b949945
Alow aliases to be nested fields.
JPBergsma Aug 17, 2022
7adbf4a
Added a line to docs to explain how to alias nested fields.
JPBergsma Aug 17, 2022
8c05579
Merge branch 'master' into JPBergsma/allow_nested_aliases
JPBergsma Aug 17, 2022
6c8294b
Merge branch 'master' into JPBergsma/allow_nested_aliases
JPBergsma Aug 27, 2022
0ebc724
Merge branch 'master' into JPBergsma/allow_nested_aliases
JPBergsma Sep 14, 2022
4388c9d
Merge branch 'Materials-Consortia:master' into JPBergsma/allow_nested…
JPBergsma Oct 19, 2022
c29cdbb
Merge branch 'master' into JPBergsma/allow_nested_aliases
JPBergsma Nov 1, 2022
368792a
1. Added support for more versatile aliassing allowing nested fields.…
JPBergsma Nov 11, 2022
38fcde7
Merge branch 'master' into JPBergsma/allow_nested_aliases
JPBergsma Nov 11, 2022
5cec6d6
Added type hint to alias_and_prefix function.
JPBergsma Nov 14, 2022
cbba008
Added missing..
JPBergsma Nov 14, 2022
67f2dea
Fixed bug in elastic search where the requested fields were determine…
JPBergsma Nov 15, 2022
775009d
Use https://github.com/pycqa/flake8 instead of https://gitlab.com/pyc…
JPBergsma Nov 15, 2022
1358b9d
Added pyyaml to install requirements and added python 3.11 classifier.
JPBergsma Nov 16, 2022
bb228ee
Made a small change to the descriptions of the nested dict functions …
JPBergsma Nov 16, 2022
8245884
Removed pyyaml from serverdeps as I placed it already under installre…
JPBergsma Nov 16, 2022
7b2320b
Removed get_value function from entries.py as it was no longer needed…
JPBergsma Nov 16, 2022
d87bc1b
Removed get_value function from entries.py as it was no longer needed…
JPBergsma Nov 16, 2022
6b99822
Simplified set_field_to_none_if_missing_in_dict and moved it out of t…
JPBergsma Nov 16, 2022
841520d
Moved functions related to handling nested dicts to utils.py.
JPBergsma Nov 16, 2022
c22089f
Updated docstring remove_exclude_fields and removed unneccesary brack…
JPBergsma Nov 16, 2022
1146273
solved merge conflict.
JPBergsma Nov 16, 2022
4c46f55
Updateof the explanation of the handling of nested fields
JPBergsma Nov 17, 2022
3a908bc
Merged changes from master.
JPBergsma Nov 30, 2022
1e49fb8
fix bug introduced by merge with master.
JPBergsma Nov 30, 2022
596bb73
Made changes to satisfy mypy.
JPBergsma Nov 30, 2022
703a9df
Added option to automatically set missing but required fields to the …
JPBergsma Nov 30, 2022
f174850
Removed get_non_optional_fields and get_schema as they are no longer …
JPBergsma Nov 30, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/getting_started/setting_up_an_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ Instead, if you are storing chemical formulae as an unreduced count per simulati
This would then instead require option 2 above, namely either the addition of auxiliary fields that store the correct (or mappable) OPTIMADE format in the database, or the creation of a secondary database that returns the pre-converted structures.

In the simplest case, the mapper classes can be used to define aliases between fields in the database and the OPTIMADE field name; these can be configured via the [`aliases`][optimade.server.config.ServerConfig.aliases] option as a dictionary mapping stored in a dictionary under the appropriate endpoint name, e.g. `"aliases": {"structures": {"chemical_formula_reduced": "my_chem_form"}}`, or defined as part of a custom mapper class.
If the alias is a nested field (i.e., a field within a dictionary), the field names should be separated by `"."`, for example: `"aliases": { "structures": {"chemical_formula_reduced": "formulae.reduced"}}`.

In either option, you should now be able to insert your data into the corresponding MongoDB (or otherwise) collection.

Expand Down
13 changes: 13 additions & 0 deletions optimade/models/jsonapi.py
Original file line number Diff line number Diff line change
Expand Up @@ -298,6 +298,19 @@ def check_illegal_attributes_fields(cls, values):
)
return values

@root_validator(pre=True)
def set_missing_to_none(cls, values):
if "set_missing_to_none" in values and values.pop("set_missing_to_none"):
for field in cls.schema()["required"]:
if field not in values:
if (
field == "structure_features"
): # It would be nice if there would be a more universal way to handle special cases like this.
values[field] = []
else:
values[field] = None
return values

Comment on lines +301 to +313
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks on the right track! I've just played around with something too after spotting something in the pydantic docs about default factories. Would the snippet also solve this issue?

>>> from pydantic import BaseModel, Field
>>> from typing import Optional
>>> class Model(BaseModel):
...     test: Optional[str] = Field(default_factory=lambda: None)
...
>>> Model()
Model(test=None)
>>> Model(test=None)
Model(test=None)
>>> Model(test="value")
Model(test='value')
>>> Model.schema()
{'title': 'Model', 'type': 'object', 'properties': {'test': {'title': 'Test', 'type': 'string'}}}

We can then patch the underlying OptimadeField and StrictField wrappers to default to having a default_factory that returns null in cases where there is no default value for the field to fall back on, and we can do this without modifying the schema or the models.

The only concern is that this functionality might get removed from pydantic:

The default_factory argument is in beta, it has been added to pydantic in v1.5 on a provisional basis. It may change significantly in future releases and its signature or behaviour will not be concrete until v2.

Though it has already lasted a few versions.


class Resource(BaseResource):
"""Resource objects appear in a JSON API document to represent resources."""
Expand Down
4 changes: 2 additions & 2 deletions optimade/models/links.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,11 @@ class Aggregate(Enum):
class LinksResourceAttributes(Attributes):
"""Links endpoint resource object attributes"""

name: str = StrictField(
name: Optional[str] = StrictField(
...,
description="Human-readable name for the OPTIMADE API implementation, e.g., for use in clients to show the name to the end-user.",
)
description: str = StrictField(
description: Optional[str] = StrictField(
ml-evs marked this conversation as resolved.
Show resolved Hide resolved
...,
description="Human-readable description for the OPTIMADE API implementation, e.g., for use in clients to show a description to the end-user.",
)
Expand Down
8 changes: 1 addition & 7 deletions optimade/models/references.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# pylint: disable=line-too-long,no-self-argument
from typing import List, Optional

from pydantic import AnyUrl, BaseModel, validator # pylint: disable=no-name-in-module
from pydantic import AnyUrl, BaseModel # pylint: disable=no-name-in-module

from optimade.models.entries import EntryResource, EntryResourceAttributes
from optimade.models.utils import OptimadeField, SupportLevel
Expand Down Expand Up @@ -260,9 +260,3 @@ class ReferenceResource(EntryResource):
queryable=SupportLevel.MUST,
)
attributes: ReferenceResourceAttributes

@validator("attributes")
def validate_attributes(cls, v):
if not any(prop[1] is not None for prop in v):
raise ValueError("reference object must have at least one field defined")
return v
8 changes: 8 additions & 0 deletions optimade/server/data/test_structures.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
},
"assemblies": null,
"chemsys": "Ac",
"dichtheid": 10.07,
"cartesian_site_positions": [
[
0.17570227444196573,
Expand Down Expand Up @@ -1222,6 +1223,7 @@
"nelements": 5,
"nsites": 24,
"pretty_formula": "Ag2C6ClH12N3",
"fancy_formulas": {"hill": "C6H12Ag2ClN3"},
"species": [
{
"chemical_symbols": [
Expand Down Expand Up @@ -1475,6 +1477,9 @@
"nelements": 5,
"nsites": 25,
"pretty_formula": "Ag2C2H2N6O13",
"fancy_formulas" : {
"hill": "C2H2Ag2N6O13"
},
"species": [
{
"chemical_symbols": [
Expand Down Expand Up @@ -1723,6 +1728,7 @@
"nelements": 7,
"nsites": 23,
"pretty_formula": "Ag2C2ClH8N5O3S2",
"fancy_formulas": {"hill": "C2H8Ag2ClN5O3S2"},
"species": [
{
"chemical_symbols": [
Expand Down Expand Up @@ -2467,6 +2473,7 @@
"nelements": 8,
"nsites": 74,
"pretty_formula": "AgB10C15Cl2H40NO3P2",
"fancy_formulas": {"hill": "C15H40AgB10Cl2NO3P2"},
"species": [
{
"chemical_symbols": [
Expand Down Expand Up @@ -2821,6 +2828,7 @@
"nelements": 7,
"nsites": 29,
"pretty_formula": "AgC3ClH14N6OS3",
"fancy_formulas":{"hill": "C3H14AgClN6OS3"},
"species": [
{
"chemical_symbols": [
Expand Down
4 changes: 1 addition & 3 deletions optimade/server/entry_collections/elasticsearch.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,9 +168,7 @@ def _run_db_query(
page_offset = criteria.get("skip", 0)
limit = criteria.get("limit", CONFIG.page_limit)

all_aliased_fields = [
self.resource_mapper.get_backend_field(field) for field in self.all_fields
]
all_aliased_fields = [field for field in criteria.get("projection", [])]
search = search.source(includes=all_aliased_fields)

elastic_sort = [
Expand Down
64 changes: 37 additions & 27 deletions optimade/server/entry_collections/entry_collections.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import re
import warnings
from abc import ABC, abstractmethod
from functools import lru_cache
from typing import Any, Dict, Iterable, List, Set, Tuple, Type, Union

from lark import Transformer
Expand All @@ -11,6 +12,7 @@
from optimade.server.config import CONFIG, SupportedBackend
from optimade.server.mappers import BaseResourceMapper
from optimade.server.query_params import EntryListingQueryParams, SingleEntryQueryParams
from optimade.utils import set_field_to_none_if_missing_in_dict
from optimade.warnings import (
FieldValueNotRecognized,
QueryParamNotUsed,
Expand Down Expand Up @@ -121,13 +123,7 @@ def count(self, **kwargs: Any) -> int:

def find(
self, params: Union[EntryListingQueryParams, SingleEntryQueryParams]
) -> Tuple[
Union[List[EntryResource], EntryResource],
int,
bool,
Set[str],
Set[str],
]:
) -> Tuple[Union[List[EntryResource], EntryResource], int, bool, Set[str]]:
"""
Fetches results and indicates if more data is available.

Expand All @@ -146,23 +142,49 @@ def find(
criteria = self.handle_query_params(params)
single_entry = isinstance(params, SingleEntryQueryParams)
response_fields = criteria.pop("fields")
response_fields_set = criteria.pop("response_fields_set", False)

raw_results, data_returned, more_data_available = self._run_db_query(
criteria, single_entry
)

exclude_fields = self.all_fields - response_fields

results: List = [self.resource_mapper.map_back(doc) for doc in raw_results]

self.check_and_add_missing_fields(results, response_fields, response_fields_set)

if results:
results = self.resource_mapper.deserialize(results)

if single_entry:
raw_results = raw_results[0] if raw_results else None # type: ignore[assignment]
results = results[0] if results else None # type: ignore[assignment]

if data_returned > 1:
raise NotFound(
detail=f"Instead of a single entry, {data_returned} entries were found",
)

exclude_fields = self.all_fields - response_fields
return results, data_returned, more_data_available, exclude_fields

def check_and_add_missing_fields(
self, results: List[dict], response_fields: set, response_fields_set: bool
):
"""Checks whether the response_fields and mandatory fields are present.
If they are not present the values are set to None, so the deserialization works correctly.
It also checks whether all fields in the response have been defined either in the model or in the config file.
If not it raises an appropriate error or warning."""
include_fields = (
response_fields - self.resource_mapper.TOP_LEVEL_NON_ATTRIBUTES_FIELDS
)
# Include missing fields
for result in results:
for field in include_fields:
set_field_to_none_if_missing_in_dict(result["attributes"], field)
Comment on lines +181 to +183
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to incur a significant performance overhead, but I guess you want to do it so that you don't have to pull e.g., entire trajectories from the database each time, yet you still want to deserialize the JSON into your classes? I think I would suggest we instead have a per-collection deserialization flag, as presumably you only want to deserialize trajectories once (on database insertion) anyway. Does that make sense?

If you want to retain this approach, it might be cleaner to do it at the pydantic level, e.g., a root_validator that explicitly sets all missing fields to None (see also https://pydantic-docs.helpmanual.io/usage/models/#creating-models-without-validation as an option).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is particularly heavy compared to all the other things we do in the code.
It is what we previously did in handle_response_fields. I only moved it here, so we can do it before the deserialization and added code to handle nested fields.

For biomolecular data, a structure can easily have 10,000 atoms, so retrieving them from the database and putting them in the model would take some time. This way we can avoid that if the species_at_sites and cartesian_site_positions are not in the response_fields. (I also made a patch in the code for Barcelona that allowed them to specify the default response fields, so they can choose to not have these fields in the response by default.)

I did not want to make the change even bigger by bypassing the rest of the validator (as in your second suggestion).
But from a performance viewpoint, bypassing the validation would be good.
Do you want me to add this to this PR or to a future PR ?

I tried the root validator idea, but it seems I already get an error before the root_validator is executed, so I do not think this solution will work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, fair enough, just looks a bit scarier as a double for loop alongside the recursive descent into dictionaries to get the nested aliases. It's quite hard to reason about this, so I might set up a separate repo for measuring performance in the extreme limits (1 structure of 10000 atoms vs 10000 structures of a few atoms -- i.e., what we have now, ignoring pagination of course).

I tried the root validator idea, but it seems I already get an error before the root_validator is executed, so I do not think this solution will work.

Did you use @root_validator(pre=True) to make sure it gets run before anything else? Perhaps bypassing the validation altogether can wait for another PR, as you suggest, but I'd like to make the performance tests first at least (doesn't necessarily hold up this PR but it might hold up the next release).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed I made a mistake in my test script.
So it may be possible to do this with a root_validator after all.

Copy link
Contributor Author

@JPBergsma JPBergsma Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a root_validator to the attributes class that, if a flag is set, checks whether all required fields are present and if not adds them and sets them to 0.

I'll try to put the handling of the other include fields back to the place where it happened originally so the code changes less.


if response_fields_set:
for result in results:
result["attributes"]["set_missing_to_none"] = True

bad_optimade_fields = set()
bad_provider_fields = set()
Expand All @@ -189,19 +211,6 @@ def find(
detail=f"Unrecognised OPTIMADE field(s) in requested `response_fields`: {bad_optimade_fields}."
)

if raw_results is not None:
results = self.resource_mapper.deserialize(raw_results)
else:
results = None

return (
results,
data_returned,
more_data_available,
exclude_fields,
include_fields,
)

@abstractmethod
def _run_db_query(
self, criteria: Dict[str, Any], single_entry: bool = False
Expand Down Expand Up @@ -244,6 +253,7 @@ def all_fields(self) -> Set[str]:

return self._all_fields

@lru_cache(maxsize=4)
def get_attribute_fields(self) -> Set[str]:
"""Get the set of attribute fields

Expand Down Expand Up @@ -327,16 +337,16 @@ def handle_query_params(
cursor_kwargs["limit"] = CONFIG.page_limit

# response_fields
cursor_kwargs["projection"] = {
f"{self.resource_mapper.get_backend_field(f)}": True
for f in self.all_fields
}

if getattr(params, "response_fields", False):
cursor_kwargs["response_fields_set"] = True
response_fields = set(params.response_fields.split(","))
response_fields |= self.resource_mapper.get_required_fields()
else:
response_fields = self.all_fields.copy()
cursor_kwargs["projection"] = {
f"{self.resource_mapper.get_backend_field(f)}": True
for f in response_fields
}

cursor_kwargs["fields"] = response_fields

Expand Down
Loading