This is a Flask-based API that fetches data from an ICAT instance, and has two sets of endpoints, for two different use cases. The first is for DataGateway which has two methods of interfacing with an ICAT stack, using a Python-based ICAT wrapper library or using sqlalchemy to communicate directly with an ICAT database.
The other use case is for the PaNOSC Search API, required to be implemented and deployed for ICAT facilities part of the PaNOSC and ExPaNDS projects. A good summary for the search API is that's it is a limited functionality version of DataGateway API (in terms of number of endpoints and query filters available to a user), but adheres more strictly to Loopback than DataGateway API (due to the specification of the search API). Like DataGateway API, the search API uses Python ICAT to fetch data from ICAT, and code is reused from DataGateway API where possible.
Both use cases can be run under the same API instance and is fully configurable. Alternatively, a user can choose to only run one of the use cases (referred to as modes) if they only require one of the products.
The recommended development environment for this API has taken lots of inspiration from the Hypermodern Python guide found online. It is assumed the commands shown in this part of the README are executed in the root directory of this repo once it has been cloned to your local machine.
To start, install pyenv. There is a Windows version of this tool (pyenv-win), however this is currently untested on this repo. This is used to manage the various versions of Python that will be used to test/lint Python during development. Install by executing the following:
curl https://pyenv.run | bash
The following lines need to be added to ~/.bashrc
- either open a new terminal or
execute source ~/.bashrc
to make these changes apply:
export PATH="~/.pyenv/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
Various Python build dependencies need to be installed next. These will vary dependent on the platform of your system (see the common pyenv build problems for the relevant command for your OS), but the following shows the bash command to install the requirements for a CentOS/RHEL machine:
sudo yum install @development zlib-devel bzip2 bzip2-devel readline-devel sqlite \
sqlite-devel openssl-devel xz xz-devel libffi-devel findutils
To make use of pyenv
, let's install different versions of Python onto the system. In
production, DataGateway API uses Python 3.6, so this should definitely be part a
development environment for this repo. This stage might take some time as each Python
version needs to be downloaded and built individually:
pyenv install 3.6.8
pyenv install 3.7.7
pyenv install 3.8.2
pyenv install 3.9.0
To verify the installation commands worked:
python3.6 --version
python3.7 --version
python3.8 --version
python3.9 --version
These Python versions need to be made available to local version of the repository. They
will used during the Nox sessions, explained further down this file. Executing the
following command will create a .python-version
file inside the repo (this file is
currently listed in .gitignore
):
pyenv local 3.6.8 3.7.7 3.8.2 3.9.0
To maintain records of the API's dependencies, Poetry is used. To install, use the following command:
curl -sSL https://install.python-poetry.org | python3 -
The installation requires the following to be added to your ~/.bashrc
file so the installation folder is on your path.
export PATH="~/.local/bin:$PATH"
Then run source ~/.bashrc
or open a new terminal and check poetry works by running poetry --version
If you encounter this error when installing poetry:
ERROR: No matching distribution found for poetry==1.2.0
You can try running the installer with python 3.8 with the command below:
curl -sSL https://install.python-poetry.org | python3.8 -
Or you can specify the version you want to install from the listed versions with the command below:
curl -sSL https://install.python-poetry.org | python3 - --version 1.2.0
The dependencies for this repo are stored in pyproject.toml
, with a more detailed
version of this data in poetry.lock
. The lock file is used to maintain the exact
versions of dependencies from system to system. To install the dependencies, execute the
following command (add --no-dev
if you don't want the dev dependencies):
poetry install
To add a dependency to Poetry, run the following command (add --dev
if it's a
development related dependency). The
official docs give good detail regarding the
intricacies of this command:
poetry add [PACKAGE-NAME]
When developing new features for the API, there are a number of Nox sessions that can be
used to lint/format/test the code in the included noxfile.py
. To install Nox, use Pip
as shown below. Nox is not listed as a Poetry dependency because this has the potential
to cause issues if Nox was executed inside Poetry (see
here
for more detailed reasoning). When using the --user
option, ensure your user's Python
installation is added to the system PATH
variable, remembering to reboot your system
if you need to change the PATH
. If you do choose to install these packages within a
virtual environment, you do not need the --user
option:
pip install --user --upgrade nox
To run the sessions defined in nox.options.sessions
(see noxfile.py
), simply run:
nox
To execute a specific nox session, the following will do that:
nox -s [SESSION/FUNCTION NAME]
Currently, the following Nox sessions have been created:
black
- this uses Black to format Python code to a pre-defined style.lint
- this uses flake8 with a number of additional plugins (see the includednoxfile.py
to see which plugins are used) to lint the code to keep it Pythonic..flake8
configuresflake8
and the plugins.safety
- this uses safety to check the dependencies (pulled directly from Poetry) for any known vulnerabilities. This session gives the output in a full ASCII style report.unit_tests
- this uses pytest to execute the automated tests intest/unit
, tests for the database and ICAT backends, and non-backend specific tests. More details about the tests themselves here.integration_tests
- this uses pytest to execute the automated tests intest/unit
, tests for the database and ICAT backends, and non-backend specific tests. Requires an ICAT backend. More details about the tests themselves here.
Each Nox session builds an environment using the repo's dependencies (defined using
Poetry) using install_with_constraints()
. This stores the dependencies in a
requirements.txt
-like format temporarily during this process, using the OS' default
temporary location. These files are manually deleted in noxfile.py
(as opposed to
being automatically removed by Python) to minimise any potential permission-related
issues as documented
here.
To make use of Git's ability to run custom hooks, pre-commit is used. Like Nox, Pip is used to install this tool:
pip install --user --upgrade pre-commit
This repo contains an existing config file for pre-commit
(.pre-commit-config.yaml
)
which needs to be installed using:
pre-commit install
When you commit work on this repo, the configured commit hooks will be executed, but only on the changed files. This is good because it keeps the process of committing a simple one, but to run the hooks on all the files locally, execute the following command:
pre-commit run --all-files
As a summary, these are the steps needed to create a dev environment for this repo compressed into a single code block:
# Install pyenv
curl https://pyenv.run | bash
# Paste into ~/.bashrc
export PATH="~/.pyenv/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
# Apply changes made in ~/.bashrc
source ~/.bashrc
# Install Python build tools
sudo yum install @development zlib-devel bzip2 bzip2-devel readline-devel sqlite \
sqlite-devel openssl-devel xz xz-devel libffi-devel findutils
# Install different versions of Python and verify they work
pyenv install 3.6.8
python3.6 --version
pyenv install 3.7.7
python3.7 --version
pyenv install 3.8.2
python3.8 --version
# Make installed Python versions available to repo
pyenv local 3.6.8 3.7.7 3.8.2
# Install Poetry
curl -sSL https://install.python-poetry.org | python3 -
# Apply changes made to file when installing Poetry
source ~/.poetry/env
# Install API's dependencies
poetry install
# Install Nox
pip install --user --upgrade nox
# Install Pre Commit
pip install --user --upgrade pre-commit
# Install commit hooks
pre-commit install
By default, the API will run on http://localhost:5000
and all requests are made here
e.g. http://localhost:5000/datagateway-api/sessions
.
Depending on the backend you want to use (either db
or python_icat
, more details
about backends here) the connection URL for the backend needs to be set.
These are set in config.yaml
(an example file is provided in the base directory of
this repository). While both db_url
and icat_url
should have values assigned to them
(for best practice), db_url
will only be used for the database backend, and icat_url
will only be used for the Python ICAT backend. Copy config.yaml.example
to
config.yaml
and set the values as needed. If you need to create an instance of ICAT,
there are a number of markdown-formatted tutorials that can be found on the
icat.manual
repository.
Since adding the search API, the endpoints for each type of API can be configured using
extension
in the respective JSON object. For example, if extension
is set to
/search-api
, then requests for the search API can be set to
http://localhost:5000/search-api
(assuming default host and port configuration). This
option is made configurable for both DataGateway API and the search API.
In addition to the configuration options in config.yaml
, the mappings between the
PaNOSC and ICAT data models need configuring. An example file exists in
datagateway_api/
which can be copied from as a starting point. Further explanation of
this file is given here.
Within the search API, there are various entities that would need ICAT 5 to work. Despite this, ICAT 5 is not required to use the search API, however, not every single piece of functionality (e.g. getting technique data) will work because that functionality/data simply doesn't exist in ICAT 4. The only strict ICAT related requirement for the search API is that the ICAT instance which is used must have the anon authenticator installed. This is because the search API only deals with public data so the anon/anon user will have the relevant permissions to not show embargoed data.
Ideally, the API would be run using the following command, the alternative (detailed below) should only be used for development purposes.
poetry run python -m datagateway_api.src.main
However, it can also be run with the flask run
command (installed with Flask). To use
flask run
, the enviroment variable FLASK_APP
should be set to
datagateway_api/src/main.py
. Once this is set, the API can be run with flask run
while inside the root directory of the project. This shouldn't be used in production, as
detailed in Flask's documentation, this method of running the API is only
"provided for convenience".
WARNING: the host, port and debug config options will not be respected when the API is run this way
Examples:
Unix:
$ export FLASK_APP=datagateway_api/src/main.py
$ poetry run flask run
CMD:
> set FLASK_APP=datagateway_api/src/main.py
> poetry run flask run
PowerShell:
> $env:FLASK_APP = "datagateway_api/src/main.py"
> poetry run flask run
The Flask app can be configured so that code changes are monitored and the server will
reload itself when a change is detected. This setting can be toggled using
flask_reloader
in config.yaml
. This is useful for development purposes. It should be
noted that when this setting is enabled, the API will go through the startup process
twice. In the case of the ICAT backend, this could dramatically increase startup time if
the API is configured with a large initial client pool size.
If you get the following error when starting the API, changes need to be made to your Poetry environment:
ModuleNotFoundError: No module named 'urlparse'
If using Python 3.10, please use Payara 5 on the ICAT stack which the API is being pointed at. There is a known issue when making HTTPS connections to Payara (via Python ICAT).
It is also possible to run the API inside Docker. The Dockerfile
can be used to build
a Docker image which in turn can be used to create a container. The Dockerfile
is
configured to create a production image and runs a Gunicorn serve on port 8000
when a
container is started. Environment variables have also been defined in the Dockerfile
to allow for values to be passed at runtime to future running containers. These values
are used by the docker/docker-entrypoint.sh
script to update the config values in the
config.yaml
file. The environment varialbes are:
ICAT_URL
(Default value:http://localhost
)ICAT_CHECK_CERT
(Default value:false
)LOG_LOCATION
(Default value:/dev/stdout
)
To build an image, run:
docker build -t datagateway_api_image .
To start a container on port 8000
from the image that you just built, run:
docker run -p 8000:8000 --name datagateway_api_container datagateway_api_image
If you want to pass values for the environment variables then instead run:
docker run -p 8000:8000 --name datagateway_api_container --env ICAT_URL=https://127.0.0.1:8181 --env ICAT_CHECK_CERT=true --env LOG_LOCATION=/datagateway-api-run/logs.log datagateway_api_image
Each request requires a valid session ID to be provided in the Authorization header.
This header should take the form of {"Authorization":"Bearer <session_id>"}
A session
ID can be obtained by sending a POST request to /sessions
. All endpoint methods that
require a session id are decorated with @requires_session_id
.
At each of the API's base paths, (http://localhost:5000/datagateway-api
and
http://localhost:5000/search-api
by default), a representation of each API will be
shown using Swagger UI. This uses an OpenAPI
specification to visualise and allow users to easily interact with the API without
building their own requests. It's great for gaining an understanding in what endpoints
are available and what inputs the requests can receive, all from an interactive
interface.
For DataGateway API, this specification is built with the Database Backend in mind (e.g. attribute names on example outputs are capitalised), however the Swagger interface can also be used with the Python ICAT Backend. More details on how the API's OpenAPI specification is built can be found here. An issue has been created for the Swagger interface to be up to date when using the Python ICAT backend.
There are two seperate test runners provided. The integration tests, and the unit tests. The unit test do not require an ICAT stack to be setup to run. The integration tests do require an ICAT stack. In order to cover all the code you will need to run both tests.
To run the unit test use nox -s unit_tests
, and to run the integration tests use nox -s integration_tests
The repository contains a variety of tests, to test the functionality of the API works as intended, for convenience
and quicker action runs these are additionally split into the unit and integration tests.
The tests are split into 3 main sections: non-backend specific (testing features such as the date handler), ICAT backend
tests (containing tests for backend specific components, including tests for the
different types of endpoints) and Database Backend tests (like the ICAT backend tests,
but covering only the most used aspects of the API).
The configuration file (config.yaml
) contains two options that will be used during the
testing of the API. Set test_user_credentials
and test_mechanism
appropriately for your test environment, using config.yaml.example
as a reference. The tests require a
connection to an instance of ICAT, so set the rest of the config as needed.
By default, this will execute the repo's tests in Python 3.6, 3.7, 3.8, 3.9 and 3.10. For most cases, running the tests in a single Python version will be sufficient:
nox -p 3.6 -s unit_tests
nox -p 3.6 -s integration
This repository also utilises pytest-cov
to check how much of the codebase is covered by the tests in test/
:
nox -p 3.6 -s unit_tests -- --cov-report term --cov=./datagateway_api
nox -p 3.6 -s integration_tests -- --cov-report term --cov=./datagateway_api
With pytest
, you can output the duration for each test, useful for showing the slower
tests in the collection (sortest from slowest to fastest). The test duration is split
into setup, call and teardown to more easily understand where the tests are being slowed
down:
nox -p 3.6 -s unit_tests -- --durations=0
nox -p 3.6 -s integration_tests -- --durations=0
To test a specific test class (or even a specific test function), you will
need to use pytest itself through poetry. If you want to change the python
version use poetry env use 3.6
which will generate a virtual env with that
version.
# Test a specific file
poetry run pytest test/integration/datagateway_api/icat/test_query.py
# Test a specific test class
poetry run pytest test/integration/datagateway_api/icat/test_query.py::TestICATQuery
# Test a specific test function
poetry run pytest test/integration/datagateway_api/icat/test_query.py::TestICATQuery::test_valid_query_exeuction
The project consists of 5 main packages:
datagateway_api.src.datagateway_api
- code for DataGateway API, for both database and Python ICAT backendsdatagateway_api.src.search_api
- Search API specific code e.g.NestedWhereFilters
for the OR functionality for WHERE clausesdatagateway_api.src.common
- code that is shared between DataGateway API and the search APIdatagateway_api.src.resources
- contains the API resources and their HTTP method definitions (e.g. GET, POST)test
- mixture of automated unit and integration tests written using Pytest
main.py
is where the flask_restful API is set up. This is where each endpoint resource
class is generated and mapped to an endpoint.
Example:
api.add_resource(get_endpoint_resource, f"/{entity_name.lower()}")
The logic for each endpoint is within /src/resources
- they're split into entities and
non_entities.
The entities package contains entity_map
which
maps entity names to their field name used in backend-specific code. The Database
Backend uses this for its mapped classes (explained below) and the Python ICAT Backend
uses this for interacting with ICAT objects within Python ICAT. In most instances, the
dictionary found in entity_map.py
is simply mapping the plural entity name (used to
build the entity endpoints) to the singular version. The entity_endpoint
module
contains the function that is used to generate endpoints at start up. Finally,
non_entities
contains the session endpoint for session handling.
Logging configuration can be found in datagateway_api.src.common.logger_setup
. This
contains a typical dictionary-based config for the standard Python logging
library
that rotates files after they become 5MB in size.
The default logging location is in the root directory of this repo. This location (and
filename) can be changed by editing the log_location
value in config.yaml
. The log
level (set to WARN
by default) can also be changed using the appropriate value in that
file.
This is a class containing static methods to deal with dates within the API. The date
handler can be used to convert dates between string and datetime objects (using a format
agreed in datagateway_api.src.common.constants
) and uses a parser from dateutil
to
detect if an input contains a date. This is useful for determining if a JSON value given
in a request body is a date, at which point it can be converted to a datetime object,
ready for storing in ICAT. The handler is currently only used in the Python ICAT
Backend, however this is non-backend specific class.
Exceptions custom to DataGateway API are defined in
datagateway_api.src.common.exceptions
. Each exception has a status code and a default
message (which can be changed when raising the exception in code). None of them are
backend specific, however some are only used in a single backend because their meaning
becomes irrelevant anywhere else.
When the API is setup in main.py
, a custom API object is created (inheriting
flask_restful's Api
object) so handle_error()
can be overridden. A previous
iteration of the API registered a error handler with the Api
object, however this
meant DataGateway API's custom error handling only worked as intended in debug mode (as
detailed in a
GitHub issue). This
solution prevents any exception returning a 500 status code (no matter the defined
status code in exceptions.py
) in production mode. This is explained in a
Stack Overflow answer.
Filters available for use in the API are defined in datagateway_api.src.common.filters
.
These filters are all based from QueryFilter
, an asbtract class to define any filter
for the API. Precedence is used to prioritise in which order filters should be applied,
but is only needed for the Database Backend.
Filtering logic is located in datagateway_api.src.common.helpers
.
get_filters_from_query_string()
uses the request query parameters to form filters to
be used within the API. A QueryFilterFactory
is used to build filters for the correct
backend and the static method within this class is called in
get_filters_from_query_string()
.
As described at the top of this file, there are currently two ways that DataGateway API creates/fetches/updates/deletes data from ICAT. The intention is each backend allows a different method to communicate with ICAT, but results in a very similarly behaving DataGateway API.
The abstract class can be found in datagateway_api.src.datagateway_api.backend
and
contains all the abstract methods that should be found in a class which implements
Backend
. The typical architecture across both backends is that the implemented
functions call a helper function to process the request and the result of that is
returned to the user.
Each backend module contains the following files which offer similar functionality, implemented in their own ways:
backend.py
- Implemented version ofdatagateway_api.src.datagateway_api.backend
filters.py
- Inherited versions of each filter defined indatagateway_api.src.common.filters
helpers.py
- Helper functions that are called inbackend.py
A function inside datagateway_api.src.datagateway_api.backends
creates an instance of a
backend using input to that function to decide which backend to create. This function is
called in main.py
which uses the backend type set in config.yaml
, or a config value
in the Flask app if it's set (this config option is only used in the tests however). The
backend object is then parsed into the endpoint classes so the correct backend can be
used.
The Database Backend uses SQLAlchemy to interface directly with the database for an instance of ICAT. This backend favours speed over thoroughness, allowing no control over which users can access a particular piece of data.
The classes mapped from the database (as described above) are stored in
/common/database/models.py
. Each model was automatically generated using sqlacodegen.
A class EntityHelper
is defined so that each model may inherit two methods to_dict()
and update_from_dict(dictionary)
, both used for returning entities and updating them,
in a form easily converted to JSON.
Sometimes referred to as the ICAT Backend, this uses
python-icat to interact with ICAT data.
The Python-based API wrapper allows ICAT Server to be accessed using the SOAP interface.
Python ICAT allows control over which users can access a particular piece of data, with
the API supporting multiple authentication mechanisms. Meta attributes such as modId
are dealt by Python ICAT, rather than the API.
Python ICAT uses client objects to authenticate users and provide interaction to ICAT (e.g. querying icatdb). A client object has a high creation cost (often taking several seconds), so it's unsuitable to create a new client object at the start of each request. In a similar vein, it would also be unsuitable to use a single client object for the entire API due to collisions between different users.
Client objects are handled using an LRU cache, fetching clients from an object pool when a new client is requested for the cache.
The cache is extended from Cachetools' implementation (although the documentation for the builtin LRU cache is more detailed, hence that's linked above) to allow for a client object to be placed back into the object pool once it becomes 'least recently used' and therefore is removed from the cache (in place of another item). Each cache item is differentiated by the arguments of the function it's applied to which in this case is the session ID. The client pool object is also passed into the function, but this is a singleton object (mandated by the library it's implemented from) so this won't change throughout the lifetime of the API.
The object pool has an initial pool size that will be created at startup, and a maximum
size that the pool can grow to if needed, where both values are configurable. The
clients within the pool do not expire and have unlimited reuses, so clients created at
startup can be used for the lifespan of the API. Python ICAT's Client
class is
extended (to ICATClient
) to rename cleanup()
to a function name that the object pool
will recognise to clean up resources and will disable the auto logout feature to prevent
sessions from being logged out when the client is reused.
Combining caching and pooling into one design gives the following high-level results. There is a 1 client to 1 session ID ratio, which will prevent collision between users and doesn't require an excessive amount of resources (such as a 1 client to 1 request ratio would). Since the object pool is created at startup, this design can cause the API to be slow to start as the pool of object needs to be created. A rough guide would be to multiply the configured initial pool size by around 5 or 6 seconds to get a time estimate for pool creation.
When configuring the cache size and the client pool, the following should be considered. The pool's max size should be configured to the maximum number of concurrent users expected for the API. The cache size must not exceed the pool's maximum size. If this does happen, the cache could attempt to acquire a client from an empty pool that cannot grow, causing the request to never respond because the API will wait indefinitely. The pool's initial size should be configured to strike a balance of reasonable startup time and not slowing down requests when the pool grows beyond its initial size. NOTE: when the pool exceeds the initial size and a client is requested by the cache, a client is created on the fly, so that request (and any others sent before the client is created and in the cache) WILL be slow. For development, the following settings (as also set in the example config) would allow for an acceptable startup time but allow for multiple session IDs to be used if required.
"client_cache_size": 5,
"client_pool_init_size": 2,
"client_pool_max_size": 5,
The ICATQuery classed is in datagateway_api.src.datagateway_api.icat.query
. This class
stores a query created with Python ICAT
(documentation). The
execute_query()
function executes the query and returns either results in either a
JSON format, or a list of
Python ICAT entity's (this
is defined using the return_json_formattable
flag). Other functions within that class
are used within execute_query()
.
While the search API shares some code from DataGateway API, there are also various differences in the functionality it offers and the way it goes about offering it.
Unlike DataGateway API, the search API does not contain any authentication or endpoints for session handling. This is because the search API only interacts with public data, so it can be assumed the anon user will be used. To deal with this, only a single client object is used for the APIs lifecycle, a contrasting solution to DataGateway API. This object is logged in upon the first request of the APIs lifecycle. For each new request, session expiry is checked; if the session has expired, the client will be logged in again so the same object can be used. Using the same client object between users and requests works because only one user (i.e. the anon user) is being used to query ICAT.
The search API deals with user inputs (via query parameters) and outputs data in the format defined by the PaNOSC data model. To interface with ICAT, there needs to be a way of translating between this data model and the ICAT schema.
To map between each data model, there is a JSON file (search_api_mapping.json
) which
defines the mappings for each PaNOSC entity (and all the attributes within them). This
is configurable so these mappings can be changed as needed - each facility uses ICAT in
slightly different ways; the example file shows the mappings used for ISIS which should
give a good place to start.
Within the mapping file, each of the JSON objects represents a PaNOSC entity. Inside
each object, there is a base_icat_entity
which defines which ICAT entity the PaNOSC
entity links to. There are also key-value pairs of all of the fields which exist for the
PaNOSC entity, where the value is the ICAT field name. For fields which are related
entities, the value contains a JSON object instead of a string. The contents of this
object are the PaNOSC entity name that the field name relates to and also the ICAT field
name translation. Looking at the example file alongside the ICAT schema is a good way to
understand how the mappings work.
The only exceptions that exist in the mapping file is for unique mapping cases; when
mapping PaNOSC pid
fields to ICAT, a list of ICAT field names are needed. This is so
if a persistent identifier does not exist, it can use an alternative field name as an
identifier. Some facilities don't use persistent identifiers for all of their metadata,
so this solution is needed to prevent things from breaking. A similar case exists for on
the base_icat_entity
for the Parameter
entity, where a list of ICAT entity names are
also needed. This is because a Parameter
can either link to a document or a dataset.
In ICAT, there are specific entities that are used to store parameters for
investigations and datasets (e.g. InvestigationParameter
and DatasetParameter
).
Since ICAT parameter types have three different places where values can be stored
(numericValue
, stringValue
, dateTimeValue
), these need to be specified in a list
too. Order is important in this case, so it is recommended to keep them in the same
order as shown in the example file.
Most of the query filters that exist in DataGateway API are also present in the search API. However, inside the query parameters of an incoming request, they are formatted differently (see query filter syntax) so a search API specific factory class to deal with the query parameters was needed.
The search API requires conditions to be
joined together using OR
,
something which isn't seen in DataGateway API. This is mainly because this isn't
directly supported by Python ICAT; its query builder class only supports the joining of
conditions by the AND
keyword. To solve this, when the query filter factory detects an
explicit joining of conditions (via the use of AND
or OR
), a NestedWhereFilters
object is created to store the conditions from the request. This class has the concept
of a left hand side and right hand side and will join them together when the object is
converted to a string - an action performed when the JPQL query is being built.
The class SearchAPIQuery
contains everything needed to build and handle a JPQL query
to be sent to an ICAT instance. ConditionSettingQuery
is a version of the Python ICAT
query class that allows the search API to set the conditions using a string, rather than
adding conditions via dictionaries. This is needed where queries are joined with AND
or OR
. This collates all the work from NestedWhereFilters
so all requires types of
conditions can be supported.
Search scoring allows for the results returned by the Search API to be scored in terms of
relevancy. The config option enabled
from the search_scoring
object in config.yaml
can be used to enable or disable the search scoring. When enabled, it handles the query
filter provided in the requests sent by the Federated Photon and Neutron Search Service,
otherwise, it returns an error to indicate that the query
filter is not supported.
For this functionality to work, an instance of the PaNOSC Search Scoring Service
is needed which has been configured and populated as per the instructions in its
repository and can return scores. The full URL to its /score
endpoint will need to be
provided to the config option api_url
from the search_scoring
object in config.yaml
so that the Search API know where to send its result from ICAT along with the value from
the query
filter for scoring.
The European Photon and Neutron Open Data Search Portal requires all Search APIs that want to be integrated with the portal to support search scoring.
When the config option generate_swagger
is set to true in config.yaml
, a YAML
file defining the API using OpenAPI standards will be created at
src/swagger/openapi.yaml
. This option should be disabled in production to avoid any
issues with read-only directories.
apispec is used to help with this, with an
APISpec()
object created in src/main.py
which endpoint specifications are added to
(using APISpec.path()
) when the endpoints are created for Flask. These paths are
iterated over and ordered alphabetically, to ensure openapi.yaml
only changes if there
have been changes to the Swagger docs of the API; without that code, Git will detect
changes on that file everytime startup occurs (preventing a clean development repo). The
contents of the APISpec
object are written to a YAML file and is used when the user
goes to the configured (root) page in their browser.
The endpoint related files in src/resources/
contain __doc__
which have the Swagger
docs for each type of endpoint. For non-entity and table endpoints, the Swagger docs are
contained in the docstrings. src/resources/swagger/
contain code to aid Swagger doc
generation, with a plugin (RestfulPlugin
) created for apispec
to extract Swagger
documentation from flask-restful
functions.
Within the repository, there are some useful files which can help with using the API.
There is a tool to generate mock data into ICAT's database. It is located in
util/icat_db_generator.py
. By default it will generate 20 years worth of data (approx
70,000 entities). The default arguments will match the data on SciGateway Preprod and
therefore this is usually a good starting point. The script makes use of random
and
Faker
and is seeded with a seed of 1. The seed and number of years of data generated
can be changed by using the arg flags -s
or --seed
for the seed, and -y
or
--years
for the number of years. For example:
python -m util.icat_db_generator -s 4 -y 10
Would set the seed to 4 and generate 10
years of data.
This uses code from the API's Database Backend, so a suitable db_url
should be
configured in config.yaml
.
When used on a machine that doesn't use UTC timezone, you may find there are a mix of
timezones when querying the API. This issue was found on SciGateway Preprod when using
BST and there would be a mix of +00:00 and +01:00 timezones
(more details with screenshots).
The current suggested workaround is to change your machine to use UTC. In the case of
SciGateway preprod, the JVM timezone was changed to UTC (in
/home/glassfish/[PAYARA_VERSION]/glassfish/domains/domain1/config/domain.xml
). This
was done to ensure the VM's system timezone wasn't changed back to BST by the automated
systems that maintain it.
With a handful of endpoints associated with each entity, there are hundreds of endpoints for this API. A Postman collection is stored in the root directory of this repository, containing over 300 requests, with each type of endpoint for every entity as well as the table and session endpoints. The exported collection is in v2.1 format and is currently the recommended export version for Postman.
This collection is mainly based around the Python ICAT Backend (request bodies for creating and updating data uses camelCase attribute names as accepted by that backend) but can easily be adapted for using the Database Backend if needed (changing attribute names to uppercase for example). The collection also contains a login request specially for the Database Backend, as logging in using that backend is slightly different to logging in via the Python ICAT Backend.
The repo's collection can be easily imported into your Postman installation by opening Postman and selecting File > Import... and choosing the Postman collection from your cloned DataGateway API repository.
This collection has not been updated for the search API endpoints, so can only be used to query DataGateway API.
This repository uses semantic versioning as the standard for version number
incrementing, with the version stored in pyproject.toml
. There is a GitHub Actions
workflow (release-build.yml
) which runs when main is updated (i.e. when a pull
request is merged). This uses
python-semantic-release to
determine whether a release needs to be made, and if so, whether a major, minor or patch
version bump should be made. This decision is made based on commit message content.
In a PR, at least one commit must follow the Angular commit message format and use one of the conventional commit types. Note, there are no scopes (part of the Angular message format) configured for this repo so there's no need to make use of this feature. Compliance to this format and use of standard types will be checked by semantic-pull-requests which is a GitHub app installed into this repo and runs alongside existing CI jobs for pull requests. For example, the following commit messages follow the conventional commit standard:
# Commit to edit a CI job
ci: edit linting job #issue-number
# Commit for a bug fix
fix: fix bug found with count endpoints #issue-number
# Commit for a new feature
feat: add endpoints for search API #issue-number
# Commit which introduces a breaking change for users
<commit-type>: change format of `config.yaml`, the previous version is no longer supported #issue-number
BREAKING CHANGE: this feature means X functionality has been removed
For each pull request, only one commit message in this format is required to satisfy the
semantic pull request checker. Requiring only one commit message in this format should
hopefully not impose this commit style on developer. However, it is encouraged to use it
where possible, as the types are also used to form CHANGELOG.md
.
New releases are only made when a fix:
(patch), feat:
(minor) or BREAKING CHANGE:
(major) commit type is found between the previous release and the most recent commit on
main. When the version is bumped, a GitHub tag and release is made which contains the
source code and the built versions of the API (sdist and wheel).
To check how the version number will be impacted before merging a pull request, use the following command to show the version which will be made when the GitHub Actions release build job runs (upon merging a branch/PR):
poetry run semantic-release print-version
Like the codebase, this README file follows a 88 character per line formatting approach.
This isn't always possible with URLs and codeblocks, but the vast majority of the file
should follow this approach. Most IDEs can be configured to include a guideline to show
where this point is. To do this in VS Code, insert the following line into
settings.json
:
"editor.rulers": [
88
]
Before a heading with a single hash, a four line gap should be given to easily indicate separation between two sections. Before every other heading (i.e. headings with two or more hashes), a two line gap should be given. This helps to denote a new heading rather than just a new paragraph. While sections can be easily distinguished in a colourful IDE, the multi-line spacing can be much easier to identify on an editor that doesn't use colours.