forked from guardian/grid
-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
264 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
Extending the Grid using providers | ||
================================== | ||
|
||
The Grid can be customised to suit your organisation. In particular, alternative "providers" for the image ingest | ||
pipelines and authentication can be loaded dynamically at start time. | ||
|
||
We aim to allow modifications to your installation of the Grid without having to alter the source code of the Grid | ||
itself over and above what might be possible with configuration changes. | ||
|
||
The two areas that are most commonly desired to be customised are image ingest processors (how metadata is extracted | ||
from images and how metadata is modified or cleaned) and authentication/authorisation (how does the Grid identify a user | ||
and what actions a user is allowed to take). | ||
|
||
The general process for creating a custom provider is to: | ||
|
||
* write an implementation of a provider interface | ||
* configure the Grid to use your provider implementation | ||
|
||
When you implement a provider interface you should ensure that your provider is compiled against the same version of the | ||
Grid as you intend to run. We will avoid making breaking changes to these interfaces unless absolutely necessary but | ||
bear in mind that we might need to do so. | ||
|
||
All provider interfaces use a common configuration loading mechanism. This can load companion objects, classes with | ||
no-arg constructors and classes with constructors that take one or two standard provider parameters. | ||
|
||
A configuration for a provider can be in one of two formats depending on whether the configuration contains provider | ||
specific configuration. | ||
|
||
If a provider doesn't need any custom configuration then you can provide just the object or class name: | ||
|
||
```hocon | ||
authentication.providers.user = "com.example.auth.MyUserProvider" | ||
``` | ||
|
||
If a provider does need custom configuration then you specify an object with `className` and `config` fields: | ||
|
||
```hocon | ||
authentication.providers.user { | ||
className = "com.example.auth.MyConfigurableUserProvider" | ||
config { | ||
systemName = "my-system-name" | ||
allowList = "s3://my-bucket/my-allow-list.json" | ||
} | ||
} | ||
``` | ||
|
||
As mentioned earlier, there are standard provider parameter types for the provider class constructors: | ||
|
||
* `play.api.Configuration` - this argument will contain the configuration in the `config` field of your provider | ||
* the provider specific resources class - this is defined by your resource type and typically provides access to AWS | ||
credentials, an execution context and a web client for making external calls |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
Image Processor provider pipelines | ||
================================== | ||
|
||
When an image is loaded into the Grid (either via the user interface or automatically via an external API call) the Grid | ||
extracts source metadata (XMP, IPTC and others) from the image. This data is stored in the database and also assigned to | ||
primary metadata fields (a reduced subset of what the Grid considers to be the key metadata fields - these are displayed | ||
to users and many are editable). | ||
|
||
The way that these primary metadata fields are populated is currently in code, but once they have been populated and | ||
before they are stored in the database the Grid runs the `Image` through a pipeline of `ImageProcessors` which are able | ||
to further process the image metadata to classify images and improve the metadata. | ||
|
||
For example, at the Guardian, we examine the metadata in an attempt to automatically determine whether an agency | ||
provided the image (and if so which agency). Importantly we can use metadata to automatically determine what usage | ||
rights should be applied to a picture. These rights allow the Grid to understand the contractual obligations of an | ||
image, whether it is free to use, usage is under a quota system or pay per use. The rights also determine how it can be | ||
used (perhaps an image is restricted for news reporting only). | ||
|
||
We can also use a set of rules to correctly set the `credit` of the image based on the agency and photographer metadata | ||
so that it can be displayed correctly when it is used. Finally, we apply a series of rules pertaining to our in house | ||
style, such as changing the capitalisation of place names and normalising the way initials are displayed (the Guardian | ||
stipulates that they shouldn't have full stops). | ||
|
||
## What is an `ImageProcessor` pipeline? | ||
|
||
A pipeline consists of a sequence of `ImageProcessor`s applied to an image. An `ImageProcessor` is an implementation of | ||
a Scala trait which, most importantly, has a function of `Image => Image`. `Image` is the main representation of a | ||
picture in the Grid and an `ImageProcessor` allows you to modify any part of it, although it is strongly recommended | ||
that only the contents of the `metadata` and `usageRights` fields are actually modified. | ||
|
||
The `ImageProcessor`s are executed in the order they are listed in the configuration. The output of the first processor | ||
is used as input to the second processor and so on. Each `Image` is immutable so your function will return a modified | ||
copy which is passed as input to the next processor. | ||
|
||
### Image processor `description` | ||
|
||
The `ImageProcessor` trait also has a `description` field. This is a String which should be used to describe what the | ||
image processor does. This should include any use configuration of the processor. For example if a processor uses an | ||
external data source such as a file from an S3 bucket then it should say in the description where it comes from. | ||
|
||
The order and description of each image processor is logged during startup to provide a record of how the Grid is | ||
configured. This can be useful for confirming that your configuration is right and for debugging when things are not | ||
working as expected. | ||
|
||
## What other components are there? | ||
|
||
There are a few helper traits which can be useful for building more complex image processors. | ||
|
||
### Metadata cleaner | ||
|
||
If you only want to modify `metadata` then you can instead implement `MetadataCleaner` which has a function of | ||
`ImageMetadata => ImageMetadata`. This is a lightweight wrapper to avoid boilerplate. | ||
|
||
### Composing image processors | ||
|
||
If you'd like to compose your image processors in code rather than configuring them all individually at runtime (which | ||
benefits from better compile time safety) then you might be interested in the `ComposedImageProcessor` trait which | ||
includes a field allowing access to the underlying processors. | ||
|
||
There is also a convenience method `ImageProcessor.compose` or you can extend `ComposeImageProcessors` which can be | ||
useful if you want to create a companion object. There are examples these approaches being used in the codebase. | ||
|
||
## Recommendations | ||
|
||
We would strongly recommend that you put classification processors ahead of cleaning processors. This is because the way | ||
in which you classify images might be broken by later changes to your cleaning processors if the cleaning is done ahead | ||
of classification. If you classify first then this will not be impacted by cleaning processors run later in the | ||
pipeline. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,145 @@ | ||
Authentication and authorisation providers | ||
========================================== | ||
|
||
Authentication and authorisation allow the Grid to identify who is using it and what they are allowed to do. | ||
|
||
Overview | ||
-------- | ||
|
||
### Authentication | ||
|
||
We distinguish between two types of identity: | ||
|
||
* human users represented by a `UserPrincipal` (people directly using the Grid via the UI) | ||
* machine users represented by a `MachinePrincipal` (automated ingest, batch processing etc. done by another | ||
application) | ||
|
||
These two types of users are usually be identified using a different strategy. For example, at the Guardian we identify | ||
machine users by an API key in the header of an API request. Human users are typically identified by looking for a | ||
cookie that a user has in their browser. If they don't have the cookie, or the cookie is out of date, then we require | ||
them to authenticate in order to obtain a valid cookie before they can continue using the Grid. | ||
|
||
In either case, the Grid can make inter-microservice calls. In order to support this a mechanism is provided to call | ||
other services on behalf of a principal. | ||
|
||
### Authorisation | ||
|
||
_**Note:** Authorisation is currently a work in progress so this sketches out the current thinking only._ | ||
|
||
When the Grid receives certain API requests it decides whether the principal making the request has permission to do so. | ||
The data which is used to make this decision can come from the principal itself, from an external source of data or a | ||
combination of the two. | ||
|
||
This is essentially a function of `(Principal, Action) => Boolean`. `Action` can be a simple permission or it can have a | ||
parameter (such as image attributes such as `uploadedBy` or `organisation`) allowing images to be visible to only | ||
subsets of users. | ||
|
||
Any `Principal` (human or machine) has an `identity` (such as an email address) and an `attributes` field. The latter is | ||
a `TypedMap` which can be used to encapsulate any permission data obtained during the authentication process. This | ||
permission data can then be used in the function implemented. | ||
|
||
## Implementation | ||
|
||
### Authentication | ||
|
||
There are separate providers for user and machine authentication which are configured using | ||
`authentication.providers.user` and `authentication.providers.machine` respectfully. The provider configured at | ||
`authentication.providers.user` must implement `UserAuthenticationProvider` and that configured for | ||
`authentication.providers.machine` must implement `MachineAuthenticationProvider`. | ||
|
||
Both providers follow a similar shape, although the user authentication is more complicated due to the additional | ||
support for logging a user in if they are not currently authenticated. | ||
|
||
Both traits can be found in | ||
[AuthenticationProvider.scala](https://github.com/guardian/grid/blob/main/common-lib/src/main/scala/com/gu/mediaservice/lib/auth/provider/AuthenticationProvider.scala) | ||
which will have the most up-to-date documentation. You should read the following documentation as a companion to the | ||
scala doc. | ||
|
||
#### UserAuthenticationProvider | ||
|
||
There are a small number of anticipated user providers (in production we'd expect installations to use one of the last | ||
two options): | ||
|
||
* No-auth - we'll likely implement a no-op auth provider for the purpose of demonstrating the Grid via docker | ||
* Basic authentication - we might also implement a very simple basic auth provider for the purpose of evaluating the | ||
Grid | ||
* Federated auth - e.g. OIDC or SAML; this is similar to the original hardcoded authentication system in that a user is | ||
sent to a third party to authenticate and then a token is returned by the user which can then be validated by the | ||
authentication provider | ||
* Proxy auth - in this case an HTTP proxy sits in front of the application, for example | ||
[oauth2-proxy](https://github.com/oauth2-proxy/oauth2-proxy) and authentication provider parses a header forwarded by | ||
the proxy service | ||
|
||
##### Federated provider | ||
|
||
A federated authentication provider is likely to need to implement all provider methods. | ||
|
||
###### Example: PanDomainAuthenticationProvider | ||
|
||
The existing `PanDomainAuthenticationProvider` uses OIDC federated authentication with a cookie that sits on the "domain | ||
root" (note that each microservice currently sits on a separate subdomain, although it wouldn't take much effort to | ||
change this to have a single domain and route to individual microservices using different paths on that domain). | ||
Unfortunately the `PanDomainAuthenticationProvider` is tightly integrated into the Guardian's ecosystem so is unlikely | ||
to be useful as anything more than a starting point. | ||
|
||
If an unauthenticated user visits the Grid then they will be redirected to the OIDC service. They will return to a | ||
callback endpoint which validates the token from the OIDC service and sets a cryptographically signed cookie. Subsequent | ||
visits and API calls use the cookie to identify the user (until the cookie expires). | ||
|
||
###### Implementation | ||
|
||
In general a provider for a federated system will implement `authenticateRequest` to check for a value in | ||
the [Play session](https://www.playframework.com/documentation/2.8.4/ScalaSessionFlash#Storing-data-in-the-Session) | ||
<sup>1</sup> which avoids the need to deal with cookie signing concerns. This description assumes that this approach is | ||
being used. | ||
|
||
The `AuthenticationStatus` is used to signal to the Grid whether a user is authenticated (and if so, who they are) or | ||
not. A user can fail authentication for a number of reasons but in most cases the Grid will then send the user for | ||
authentication using the `sendForAuthentication` function. This will typically redirect a user to the federated | ||
authentication service with appropriate parameters (including the return URL). The user's browser will then take the | ||
user through authentication and eventually land back at the return URL on the Grid. That return URL will call the | ||
`sendForAuthenticationCallback` function which must validate the token returned by the federated authentication service | ||
prior to setting appropriate values in the Play session. | ||
|
||
There are two other methods that must be implemented: the `flushToken` endpoint should remove the authentication data | ||
from the play session and `onBehalfOf` must pass the whole cookie (with the name from the play config key | ||
`session.cookieName`) on to the downstream requests. To achieve this you will likely want to push the cookie value into | ||
the `attributes` map and then pull it out in much the same way as is implemented for the | ||
`PanDomainAuthenticationProvider` described above. | ||
|
||
<sup>1</sup> notes: for this to work you'll also need to ensure that `play.http.secret.key` is configured to be the same | ||
across all services and `session.domain` is set to a shared domain root; whilst the session is tamper-proof, be aware | ||
that data stored in the session is visible to the user. | ||
|
||
##### Proxy authentication | ||
|
||
If the Grid is behind a proxy that is handling authentication then it is likely that the provider only needs to | ||
implement `authenticateRequest` and `onBehalfOf`. The former will extract and validate (if necessary) the HTTP header | ||
containing the authentication token. This header will need to be stored in the `attributes` field of the user. The | ||
latter method will simply need to add the header to outgoing requests. The remaining methods can simply be implemented | ||
with `None`. | ||
|
||
In the case of using proxy authentication, there is no need to run the `auth` microservice. | ||
|
||
**Warning:** One remaining issue is how the authentication proxy deals with users who are not logged in or whose | ||
authentication has expired. When using a federated authentication service, the Grid signals to the kahuna single page | ||
application that the user session has expired by returning a `419` status code for any API calls. Kahuna might need to | ||
be modified to recognise other status codes and headers as a requirement for re-authenticating the user. | ||
|
||
#### MachineAuthenticationProvider | ||
|
||
There are also a small number of anticipated `MachineAuthenticationProviders`: | ||
|
||
* A no-op provider to allow easy use via the docker demo | ||
* An API key provider (the current default with keys in an S3 bucket) | ||
* Alternative API key providers (possibly backed by a database or using a signing mechanism rather than a plain text | ||
key) | ||
|
||
In each case there are only two methods that need to be implemented. The first is the `authenticateRequest` which should | ||
validate the appropriate HTTP header and create the MachinePrincipal as appropriate (storing the auth header in the | ||
`attributes` map for downstream requests). Secondly it will need to implement the `onBehalfOf` method to allow | ||
downstream calls by appending the auth header to requests. | ||
|
||
## Implementing an authorisation provider | ||
|
||
The authentication provider was not merged at the time of writing these docs so the documentation doesn't yet exist. |