Skip to content

Commit

Permalink
Add documentation for providers
Browse files Browse the repository at this point in the history
  • Loading branch information
sihil committed Feb 3, 2021
1 parent bf30fe9 commit d8e4f1d
Show file tree
Hide file tree
Showing 3 changed files with 264 additions and 0 deletions.
51 changes: 51 additions & 0 deletions docs/07-extending/01-provider-interfaces.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
Extending the Grid using providers
==================================

The Grid can be customised to suit your organisation. In particular, alternative "providers" for the image ingest
pipelines and authentication can be loaded dynamically at start time.

We aim to allow modifications to your installation of the Grid without having to alter the source code of the Grid
itself over and above what might be possible with configuration changes.

The two areas that are most commonly desired to be customised are image ingest processors (how metadata is extracted
from images and how metadata is modified or cleaned) and authentication/authorisation (how does the Grid identify a user
and what actions a user is allowed to take).

The general process for creating a custom provider is to:

* write an implementation of a provider interface
* configure the Grid to use your provider implementation

When you implement a provider interface you should ensure that your provider is compiled against the same version of the
Grid as you intend to run. We will avoid making breaking changes to these interfaces unless absolutely necessary but
bear in mind that we might need to do so.

All provider interfaces use a common configuration loading mechanism. This can load companion objects, classes with
no-arg constructors and classes with constructors that take one or two standard provider parameters.

A configuration for a provider can be in one of two formats depending on whether the configuration contains provider
specific configuration.

If a provider doesn't need any custom configuration then you can provide just the object or class name:

```hocon
authentication.providers.user = "com.example.auth.MyUserProvider"
```

If a provider does need custom configuration then you specify an object with `className` and `config` fields:

```hocon
authentication.providers.user {
className = "com.example.auth.MyConfigurableUserProvider"
config {
systemName = "my-system-name"
allowList = "s3://my-bucket/my-allow-list.json"
}
}
```

As mentioned earlier, there are standard provider parameter types for the provider class constructors:

* `play.api.Configuration` - this argument will contain the configuration in the `config` field of your provider
* the provider specific resources class - this is defined by your resource type and typically provides access to AWS
credentials, an execution context and a web client for making external calls
68 changes: 68 additions & 0 deletions docs/07-extending/02-image-processors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
Image Processor provider pipelines
==================================

When an image is loaded into the Grid (either via the user interface or automatically via an external API call) the Grid
extracts source metadata (XMP, IPTC and others) from the image. This data is stored in the database and also assigned to
primary metadata fields (a reduced subset of what the Grid considers to be the key metadata fields - these are displayed
to users and many are editable).

The way that these primary metadata fields are populated is currently in code, but once they have been populated and
before they are stored in the database the Grid runs the `Image` through a pipeline of `ImageProcessors` which are able
to further process the image metadata to classify images and improve the metadata.

For example, at the Guardian, we examine the metadata in an attempt to automatically determine whether an agency
provided the image (and if so which agency). Importantly we can use metadata to automatically determine what usage
rights should be applied to a picture. These rights allow the Grid to understand the contractual obligations of an
image, whether it is free to use, usage is under a quota system or pay per use. The rights also determine how it can be
used (perhaps an image is restricted for news reporting only).

We can also use a set of rules to correctly set the `credit` of the image based on the agency and photographer metadata
so that it can be displayed correctly when it is used. Finally, we apply a series of rules pertaining to our in house
style, such as changing the capitalisation of place names and normalising the way initials are displayed (the Guardian
stipulates that they shouldn't have full stops).

## What is an `ImageProcessor` pipeline?

A pipeline consists of a sequence of `ImageProcessor`s applied to an image. An `ImageProcessor` is an implementation of
a Scala trait which, most importantly, has a function of `Image => Image`. `Image` is the main representation of a
picture in the Grid and an `ImageProcessor` allows you to modify any part of it, although it is strongly recommended
that only the contents of the `metadata` and `usageRights` fields are actually modified.

The `ImageProcessor`s are executed in the order they are listed in the configuration. The output of the first processor
is used as input to the second processor and so on. Each `Image` is immutable so your function will return a modified
copy which is passed as input to the next processor.

### Image processor `description`

The `ImageProcessor` trait also has a `description` field. This is a String which should be used to describe what the
image processor does. This should include any use configuration of the processor. For example if a processor uses an
external data source such as a file from an S3 bucket then it should say in the description where it comes from.

The order and description of each image processor is logged during startup to provide a record of how the Grid is
configured. This can be useful for confirming that your configuration is right and for debugging when things are not
working as expected.

## What other components are there?

There are a few helper traits which can be useful for building more complex image processors.

### Metadata cleaner

If you only want to modify `metadata` then you can instead implement `MetadataCleaner` which has a function of
`ImageMetadata => ImageMetadata`. This is a lightweight wrapper to avoid boilerplate.

### Composing image processors

If you'd like to compose your image processors in code rather than configuring them all individually at runtime (which
benefits from better compile time safety) then you might be interested in the `ComposedImageProcessor` trait which
includes a field allowing access to the underlying processors.

There is also a convenience method `ImageProcessor.compose` or you can extend `ComposeImageProcessors` which can be
useful if you want to create a companion object. There are examples these approaches being used in the codebase.

## Recommendations

We would strongly recommend that you put classification processors ahead of cleaning processors. This is because the way
in which you classify images might be broken by later changes to your cleaning processors if the cleaning is done ahead
of classification. If you classify first then this will not be impacted by cleaning processors run later in the
pipeline.
145 changes: 145 additions & 0 deletions docs/07-extending/03-authentication.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
Authentication and authorisation providers
==========================================

Authentication and authorisation allow the Grid to identify who is using it and what they are allowed to do.

Overview
--------

### Authentication

We distinguish between two types of identity:

* human users represented by a `UserPrincipal` (people directly using the Grid via the UI)
* machine users represented by a `MachinePrincipal` (automated ingest, batch processing etc. done by another
application)

These two types of users are usually be identified using a different strategy. For example, at the Guardian we identify
machine users by an API key in the header of an API request. Human users are typically identified by looking for a
cookie that a user has in their browser. If they don't have the cookie, or the cookie is out of date, then we require
them to authenticate in order to obtain a valid cookie before they can continue using the Grid.

In either case, the Grid can make inter-microservice calls. In order to support this a mechanism is provided to call
other services on behalf of a principal.

### Authorisation

_**Note:** Authorisation is currently a work in progress so this sketches out the current thinking only._

When the Grid receives certain API requests it decides whether the principal making the request has permission to do so.
The data which is used to make this decision can come from the principal itself, from an external source of data or a
combination of the two.

This is essentially a function of `(Principal, Action) => Boolean`. `Action` can be a simple permission or it can have a
parameter (such as image attributes such as `uploadedBy` or `organisation`) allowing images to be visible to only
subsets of users.

Any `Principal` (human or machine) has an `identity` (such as an email address) and an `attributes` field. The latter is
a `TypedMap` which can be used to encapsulate any permission data obtained during the authentication process. This
permission data can then be used in the function implemented.

## Implementation

### Authentication

There are separate providers for user and machine authentication which are configured using
`authentication.providers.user` and `authentication.providers.machine` respectfully. The provider configured at
`authentication.providers.user` must implement `UserAuthenticationProvider` and that configured for
`authentication.providers.machine` must implement `MachineAuthenticationProvider`.

Both providers follow a similar shape, although the user authentication is more complicated due to the additional
support for logging a user in if they are not currently authenticated.

Both traits can be found in
[AuthenticationProvider.scala](https://github.com/guardian/grid/blob/main/common-lib/src/main/scala/com/gu/mediaservice/lib/auth/provider/AuthenticationProvider.scala)
which will have the most up-to-date documentation. You should read the following documentation as a companion to the
scala doc.

#### UserAuthenticationProvider

There are a small number of anticipated user providers (in production we'd expect installations to use one of the last
two options):

* No-auth - we'll likely implement a no-op auth provider for the purpose of demonstrating the Grid via docker
* Basic authentication - we might also implement a very simple basic auth provider for the purpose of evaluating the
Grid
* Federated auth - e.g. OIDC or SAML; this is similar to the original hardcoded authentication system in that a user is
sent to a third party to authenticate and then a token is returned by the user which can then be validated by the
authentication provider
* Proxy auth - in this case an HTTP proxy sits in front of the application, for example
[oauth2-proxy](https://github.com/oauth2-proxy/oauth2-proxy) and authentication provider parses a header forwarded by
the proxy service

##### Federated provider

A federated authentication provider is likely to need to implement all provider methods.

###### Example: PanDomainAuthenticationProvider

The existing `PanDomainAuthenticationProvider` uses OIDC federated authentication with a cookie that sits on the "domain
root" (note that each microservice currently sits on a separate subdomain, although it wouldn't take much effort to
change this to have a single domain and route to individual microservices using different paths on that domain).
Unfortunately the `PanDomainAuthenticationProvider` is tightly integrated into the Guardian's ecosystem so is unlikely
to be useful as anything more than a starting point.

If an unauthenticated user visits the Grid then they will be redirected to the OIDC service. They will return to a
callback endpoint which validates the token from the OIDC service and sets a cryptographically signed cookie. Subsequent
visits and API calls use the cookie to identify the user (until the cookie expires).

###### Implementation

In general a provider for a federated system will implement `authenticateRequest` to check for a value in
the [Play session](https://www.playframework.com/documentation/2.8.4/ScalaSessionFlash#Storing-data-in-the-Session)
<sup>1</sup> which avoids the need to deal with cookie signing concerns. This description assumes that this approach is
being used.

The `AuthenticationStatus` is used to signal to the Grid whether a user is authenticated (and if so, who they are) or
not. A user can fail authentication for a number of reasons but in most cases the Grid will then send the user for
authentication using the `sendForAuthentication` function. This will typically redirect a user to the federated
authentication service with appropriate parameters (including the return URL). The user's browser will then take the
user through authentication and eventually land back at the return URL on the Grid. That return URL will call the
`sendForAuthenticationCallback` function which must validate the token returned by the federated authentication service
prior to setting appropriate values in the Play session.

There are two other methods that must be implemented: the `flushToken` endpoint should remove the authentication data
from the play session and `onBehalfOf` must pass the whole cookie (with the name from the play config key
`session.cookieName`) on to the downstream requests. To achieve this you will likely want to push the cookie value into
the `attributes` map and then pull it out in much the same way as is implemented for the
`PanDomainAuthenticationProvider` described above.

<sup>1</sup> notes: for this to work you'll also need to ensure that `play.http.secret.key` is configured to be the same
across all services and `session.domain` is set to a shared domain root; whilst the session is tamper-proof, be aware
that data stored in the session is visible to the user.

##### Proxy authentication

If the Grid is behind a proxy that is handling authentication then it is likely that the provider only needs to
implement `authenticateRequest` and `onBehalfOf`. The former will extract and validate (if necessary) the HTTP header
containing the authentication token. This header will need to be stored in the `attributes` field of the user. The
latter method will simply need to add the header to outgoing requests. The remaining methods can simply be implemented
with `None`.

In the case of using proxy authentication, there is no need to run the `auth` microservice.

**Warning:** One remaining issue is how the authentication proxy deals with users who are not logged in or whose
authentication has expired. When using a federated authentication service, the Grid signals to the kahuna single page
application that the user session has expired by returning a `419` status code for any API calls. Kahuna might need to
be modified to recognise other status codes and headers as a requirement for re-authenticating the user.

#### MachineAuthenticationProvider

There are also a small number of anticipated `MachineAuthenticationProviders`:

* A no-op provider to allow easy use via the docker demo
* An API key provider (the current default with keys in an S3 bucket)
* Alternative API key providers (possibly backed by a database or using a signing mechanism rather than a plain text
key)

In each case there are only two methods that need to be implemented. The first is the `authenticateRequest` which should
validate the appropriate HTTP header and create the MachinePrincipal as appropriate (storing the auth header in the
`attributes` map for downstream requests). Secondly it will need to implement the `onBehalfOf` method to allow
downstream calls by appending the auth header to requests.

## Implementing an authorisation provider

The authentication provider was not merged at the time of writing these docs so the documentation doesn't yet exist.

0 comments on commit d8e4f1d

Please sign in to comment.