Skip to content

Commit

Permalink
Merge branch 'main' into DSEGOG-321-delete-records
Browse files Browse the repository at this point in the history
  • Loading branch information
MRichards99 committed Aug 13, 2024
2 parents 95321ba + 8ec8c57 commit ce27ee0
Show file tree
Hide file tree
Showing 34 changed files with 358 additions and 207 deletions.
7 changes: 5 additions & 2 deletions .github/ci_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,16 @@ app:
port: 8000
# API will auto-reload when changes on code files are detected
reload: true
url_prefix: ""
images:
image_thumbnail_size: [50, 50]
waveform_thumbnail_size: [100, 100]
thumbnail_size: [50, 50]
default_colour_map: viridis
colourbar_height_pixels: 16
upload_image_threads: 4
preferred_colour_map_pref_name: PREFERRED_COLOUR_MAP
waveforms:
thumbnail_size: [100, 100]
line_width: 0.3
echo:
url: http://127.0.0.1:9000
username: operationsgateway
Expand Down
3 changes: 0 additions & 3 deletions .github/ci_ingest_echo_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,6 @@ script_options:
# If you want the script to restart ingestion midway through, specify the last
# successful file that ingested e.g. data/2023-06-04T1200.h5
file_to_restart_ingestion: ""
ssh:
enabled: false
ssh_connection_url: 127.0.0.1
database:
connection_uri: mongodb://localhost:27017/opsgateway
remote_experiments_file_path: /tmp/experiments_for_mongoimport.json
Expand Down
50 changes: 2 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
# OperationsGateway API
This is an API built using [FastAPI](https://fastapi.tiangolo.com/) to work with [MongoDB](https://www.mongodb.com/) and the data stored as part of the OperationsGateway project.

Assuming default configuration, the API will exist on 127.0.0.1:8000. You can visit `/docs` in a browser which will give an OpenAPI interface detailing each of the endpoints and an option to send requests to them. Alternatively, you can send requests to the API using a platform such as Postman to construct and save specific requests.


## Environment Setup
If not already present, you may need to install development tools for the desired Python version using the appropriate package manager for your OS. For example, for Python3.8 on Fedora or RHEL:
Expand Down Expand Up @@ -54,51 +56,3 @@ Press enter twice when prompted for the password so as not to set one.
The key should be OpenSSH encoded - this format is generated by default in Rocky 8. You can check whether your key is in the correct format by checking the start of the private key; it should be `-----BEGIN OPENSSH PRIVATE KEY-----`.

Then edit the ```private_key_path``` and ```public_key_path``` settings in the ```auth``` section of the ```config.yml``` file to reflect the location where these keys have been created.

### Adding User Accounts

The authentication system requires any users of the system to have an account set up in the database. Two types of user login are currently supported: federal ID logins for "real" users, and "local" logins for functional accounts.

To add some test accounts to the system, use the user data stored in `util/users_for_mongoimport.json`. Use the following command to import those users into the database:

```bash
mongoimport --db='opsgateway' --collection='users' --mode='upsert' --file='util/users_for_mongoimport.json'
```

Using the `upsert` mode allows you to update existing users with any changes that are made (e.g. added an authorised route to their entry) and any new users are inserted as normal. The command's output states the number of documents that have been added and how many have been updated.

## Echo Object Storage
Waveforms and images are stored using S3 object storage (using the same bucket), currently the STFC Echo instance. Lots of documentation online references the AWS offering, but as S3 is the underlying storage technique, we can interact with Echo in the same way that a user would interact with AWS S3.

Configuration to connect with Echo is stored in the `echo` section of the config file - credentials are stored in Keeper. This section includes a bucket name, which is the location on S3 storage where images & waveforms will be stored. For the API, we have multiple buckets, used for different purposes. For example, there's a bucket used for the dev server, a bucket per developer for their development environment, as well as buckets that are created for a short period of time for specific testing. This ensures that we're not overwriting each other's data and causing issues. For GitHub Actions, each run will create a new bucket, ingest data for testing and delete the bucket at the end of the run.

To manage buckets, [s4cmd](https://github.com/bloomreach/s4cmd) is a good command line utility. It provides an Unix-like interface to S3 storage, based off of `s3cmd` but has higher performance when interacting with large files. It is a development dependency for this repository but can also be installed using `pip`. There's an example configuration file in `.github/ci_s3cfg` which can be placed in `~/.s3cfg` and used for your own development environment.

Here's a few useful example commands (the [s4cmd README](https://github.com/bloomreach/s4cmd/blob/master/README.md) provides useful information about all available commands):
```bash
# To make calling `s4cmd` easier when installed as a development dependency, I've added the following alias to `~/.bashrc`
# Change the path to the Poetry virtualenv as needed
alias s4cmd='/root/.cache/pypoetry/virtualenvs/operationsgateway-api-pfN98gKB-py3.8/bin/s4cmd --endpoint-url https://s3.echo.stfc.ac.uk'

# The following commands assume the alias has been made
# Create a bucket called 'og-my-test-bucket' on STFC's Echo S3
s4cmd mb s3://og-my-test-bucket

# List everything that the current user can see
s4cmd ls

# List everything inside 'og-my-test-bucket'
s4cmd ls s3://og-my-test-bucket

# Remove all objects in bucket
s4cmd del --recursive s3://og-my-test-bucket
```

## API Startup
To start the API, use the following command:

```bash
poetry run python -m operationsgateway_api.src.main
```

Assuming default configuration, the API will exist on 127.0.0.1:8000. You can visit `/docs` in a browser which will give an OpenAPI interface detailing each of the endpoints and an option to send requests to them. Alternatively, you can send requests to the API using a platform such as Postman to construct and save specific requests.
16 changes: 7 additions & 9 deletions docs/dev_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,23 +24,23 @@ To view that commit, you can go to GitHub and insert the commit hash into the fo
To control the API on the dev server, a `systemd` service is added to the machine by Ansible. This allows you to do all the typical things you'd be able to do with a systemd service (i.e. start/stop/restart) and you can check on the logs using `journalctl -u og-api`.

## Apache/Certificates
The API is exposed to the outside world using a reverse proxy; the API lives on port 8000 but port 443 is used to access it. If a user accesses the API using port 80, it'll forward on their request to port 443. This works fine for GET requests but unusual things can happen for other request types (e.g. POST), particularly requests which contain a request body (see https://stackoverflow.com/a/21859641 for further information). The reverse proxy places the API on `/api`. This reserves `/` for the frontend, when it is deployed.
The API is exposed to the outside world using a reverse proxy; the API lives on port 8000 but port 443 is used to access it. If a user accesses the API using port 80, it'll forward on their request to port 443. This works fine for GET requests but unusual things can happen for other request types (e.g. POST), particularly requests which contain a request body (see https://stackoverflow.com/a/21859641 for further information). The reverse proxy places the API on `/api`. The frontend is deployed on `/`.

Certificates are requested through the DI Service Desk, where the normal process applies - generate a CSR, submit a ticket containing the CSR asking for the certificate to be generated and download the files once they've been generated. Alan's [certificate cheatsheet](https://github.com/ral-facilities/dseg-docs/blob/master/certs-cheat-sheet.md) is a great resource to easily generate a CSR if you're not familiar with that process.

When downloading the certificates, I click the following links:
- `cert` - "as Certificate only, PEM encoded"
- `ca` (you need to remove the first certificate (our cert) from this file, so 2 remain) - "as Certificate (w/ issuer after), PEM encoded:"
- `ca` - "as Certificate (w/ issuer after), PEM encoded:" (you need to remove the first certificate (our cert) from this file, so 2 remain)

The certificate files are stored in `/etc/httpd/certs/` and symlinks are applied to them which allows easy swapping of files. When changes are made, do a `systemctl restart httpd` to ensure any file/config changes take effect.

To open ports, use `firewall-cmd`; this is a Rocky 8 VM so this is different to older Centos 7 RIG VMs where `iptables` was used. To view current rules, use `firewall-cmd --list-all`. Ports 80 & 443 are opened when deployed using Ansible.
To open ports, use `firewall-cmd` (use `--permanent` to keep the rule persistent across reboots); this is a Rocky 8 VM so this is different to older Centos 7 RIG VMs where `iptables` was used. To view current rules, use `firewall-cmd --list-all`. Ports 80 & 443 are opened when deployed using Ansible.

## Storage
The API is hooked up to a MongoDB database provided by Database Services containing simulated data as well as using Echo S3. Credentials for these resources are stored in the shared Keeper folder and a specific bucket is used for the dev server (`s3://og-dev-server`).
The API is hooked up to a MongoDB database provided by Database Services containing simulated data as well as using Echo S3. Credentials for these resources are stored in the shared Keeper folder and a specific bucket is used for the dev server (`s3://OG-DEV-SERVER`).

### Simulated Data
The dev server contains 12 months worth of simulated data (October 2022-October 2023) which is reproducible using HDF files stored in the `s3://OG-YEAR-OF-SIMULATED-DATA` bucket in Echo. There are cron jobs which control data generated each day, a test to mimic incoming load from EPAC in production. More detail about the inner workings of this mechanism can be found in `docs/epac_simulated_data.md`.
The dev server contains 12 months worth of simulated data (October 2022-October 2023) which is reproducible using HDF files stored in the `s3://OG-YEAR-OF-SIMULATED-DATA` bucket in Echo. There are cron jobs which control data generated each day, functioning as a test to mimic incoming load from EPAC in production. More detail about the inner workings of this mechanism can be found in `docs/epac_simulated_data.md`.

This ingestion was done using a separate cloud VM. It has its own instance of the API but is connected to the same database and Echo bucket as the dev server. To replicate this environment, run OperationsGateway Ansible using the `simulated-data-ingestion-vm` host. The `ingest_echo_data.py` script was run on that machine using the following command:
```bash
Expand All @@ -49,11 +49,9 @@ This ingestion was done using a separate cloud VM. It has its own instance of th
nohup $HOME/.local/bin/poetry run python -u util/realistic_data/ingest_echo_data.py >> /var/log/operationsgateway-api/echo_ingestion_script.log 2>&1 &
```


### Local Database for Gemini Data
Before using simulated EPAC data, we used a small amount of Gemini data, stored in a local database; it is equivalent to the databases used in our development environments - local DBs, no auth, named `opsgateway`. There may be cases where in the future, we need to switch back to the Gemini data as this may allow us to test something that isn't so easy to test with the simulated data. To do this, the following things will need to be done:
- Change the API config to point to the local database - both the URL and database name are different
- Point to a different Echo bucket - images were stored on disk when the Gemini data was last used so a new bucket should be created. The images used to be stored in `/epac_storage` but have since been deleted.
- Point to a different Echo bucket - images were stored on disk when the Gemini data was last used so a new bucket should be created (this might require running `ingest_hdf.py` to put the data onto Echo). The images used to be stored in `/epac_storage` but have since been deleted.
- Re-ingest the data using `ingest_hdf.py` (required as waveforms are now stored on Echo rather than in the database)
- Restart the API using `systemctl restart og-api`

An upcoming piece of work (as of Feb 2024) is to move waveforms to be stored in Echo (instead of the database). When this happens (and is deployed to the dev server), reingestion of the Gemini data might be required. Follow the instructions in `docs/test_data.md` for more info on this.
33 changes: 33 additions & 0 deletions docs/echo_object_storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Echo Object Storage
Waveforms and images are stored using S3 object storage (using the same bucket), currently the STFC Echo instance. Lots of documentation online references the AWS offering, but as S3 is the underlying storage technique, we can interact with Echo in the same way that a user would interact with AWS S3.

Configuration to connect with Echo is stored in the `echo` section of the config file - credentials are stored in Keeper. This section includes a bucket name, which is the location on S3 storage where images & waveforms will be stored. For the API, we have multiple buckets, used for different purposes. For example, there's a bucket used for the dev server, a bucket per developer for their development environment, as well as buckets that are created for a short period of time for specific testing. This ensures that we're not overwriting each other's data and causing issues. For GitHub Actions, each run will create a new bucket, ingest data for testing and delete the bucket at the end of the run.

To manage buckets, [s4cmd](https://github.com/bloomreach/s4cmd) is a good command line utility. It provides an Unix-like interface to S3 storage, based off of `s3cmd` but has higher performance when interacting with large files. It is a development dependency for this repository but can also be installed using `pip`. There's an example configuration file in `.github/ci_s3cfg` which can be placed in `~/.s3cfg` and used for your own development environment.

Here's a few useful example commands (the [s4cmd README](https://github.com/bloomreach/s4cmd/blob/master/README.md) provides useful information about all available commands):
```bash
# To make calling `s4cmd` easier when installed as a development dependency, I've added the following alias to `~/.bashrc`
# Change the path to the Poetry virtualenv as needed
alias s4cmd='/root/.cache/pypoetry/virtualenvs/operationsgateway-api-pfN98gKB-py3.8/bin/s4cmd --endpoint-url https://s3.echo.stfc.ac.uk'

# The following commands assume the alias has been made
# Create a bucket called 'og-my-test-bucket' on STFC's Echo S3
s4cmd mb s3://og-my-test-bucket

# List everything that the current user can see
s4cmd ls

# List everything inside 'og-my-test-bucket'
s4cmd ls s3://og-my-test-bucket

# Remove all objects in bucket
s4cmd del --recursive s3://og-my-test-bucket
```

## API Startup
To start the API, use the following command:

```bash
poetry run python -m operationsgateway_api.src.main
```
3 changes: 1 addition & 2 deletions docs/epac_simulated_data.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
# EPAC Simulated Data

`util/realistic_data` is a directory that contains a number of scripts (and assisting Python code) to generate and ingest simulated data. This will allow us to have a more effective test & demo platform as the data should be closer to real data than the small amount of Gemini data we have access to. We hope to have one year of simulated data, and data generated each day to simulate an incoming load from the facility.
`util/realistic_data` is a directory that contains a number of scripts (and assisting Python code) to generate and ingest simulated data. This will allow us to have a more effective test & demo platform as the data should be closer to real data than the small amount of Gemini data we have access to. We have one year of simulated data, and data is generated & ingested each day to simulate an incoming load from the facility.

The data is originally generated by a tool made by CLF, [EPAC-DataSim](https://github.com/CentralLaserFacility/EPAC-DataSim). The outputs from this tool are used in the scripts which form the following process:

Expand Down
Loading

0 comments on commit ce27ee0

Please sign in to comment.