Add documentation for providers

bbc · Feb 3, 2021 · d8e4f1d · d8e4f1d
1 parent bf30fe9
commit d8e4f1d
Show file tree

Hide file tree

Showing 3 changed files with 264 additions and 0 deletions.
diff --git a/docs/07-extending/01-provider-interfaces.md b/docs/07-extending/01-provider-interfaces.md
@@ -0,0 +1,51 @@
+Extending the Grid using providers
+==================================
+
+The Grid can be customised to suit your organisation. In particular, alternative "providers" for the image ingest
+pipelines and authentication can be loaded dynamically at start time.
+
+We aim to allow modifications to your installation of the Grid without having to alter the source code of the Grid
+itself over and above what might be possible with configuration changes.
+
+The two areas that are most commonly desired to be customised are image ingest processors (how metadata is extracted
+from images and how metadata is modified or cleaned) and authentication/authorisation (how does the Grid identify a user
+and what actions a user is allowed to take).
+
+The general process for creating a custom provider is to:
+
+* write an implementation of a provider interface
+* configure the Grid to use your provider implementation
+
+When you implement a provider interface you should ensure that your provider is compiled against the same version of the
+Grid as you intend to run. We will avoid making breaking changes to these interfaces unless absolutely necessary but
+bear in mind that we might need to do so.
+
+All provider interfaces use a common configuration loading mechanism. This can load companion objects, classes with
+no-arg constructors and classes with constructors that take one or two standard provider parameters.
+
+A configuration for a provider can be in one of two formats depending on whether the configuration contains provider
+specific configuration.
+
+If a provider doesn't need any custom configuration then you can provide just the object or class name:
+
+```hocon
+authentication.providers.user = "com.example.auth.MyUserProvider"
+```
+
+If a provider does need custom configuration then you specify an object with `className` and `config` fields:
+
+```hocon
+authentication.providers.user {
+  className = "com.example.auth.MyConfigurableUserProvider"
+  config {
+    systemName = "my-system-name"
+    allowList = "s3://my-bucket/my-allow-list.json"
+  }
+}
+```
+
+As mentioned earlier, there are standard provider parameter types for the provider class constructors:
+
+* `play.api.Configuration` - this argument will contain the configuration in the `config` field of your provider
+* the provider specific resources class - this is defined by your resource type and typically provides access to AWS
+  credentials, an execution context and a web client for making external calls
diff --git a/docs/07-extending/02-image-processors.md b/docs/07-extending/02-image-processors.md
@@ -0,0 +1,68 @@
+Image Processor provider pipelines
+==================================
+
+When an image is loaded into the Grid (either via the user interface or automatically via an external API call) the Grid
+extracts source metadata (XMP, IPTC and others) from the image. This data is stored in the database and also assigned to
+primary metadata fields (a reduced subset of what the Grid considers to be the key metadata fields - these are displayed
+to users and many are editable).
+
+The way that these primary metadata fields are populated is currently in code, but once they have been populated and
+before they are stored in the database the Grid runs the `Image` through a pipeline of `ImageProcessors` which are able
+to further process the image metadata to classify images and improve the metadata.
+
+For example, at the Guardian, we examine the metadata in an attempt to automatically determine whether an agency
+provided the image (and if so which agency). Importantly we can use metadata to automatically determine what usage
+rights should be applied to a picture. These rights allow the Grid to understand the contractual obligations of an
+image, whether it is free to use, usage is under a quota system or pay per use. The rights also determine how it can be
+used (perhaps an image is restricted for news reporting only).
+
+We can also use a set of rules to correctly set the `credit` of the image based on the agency and photographer metadata
+so that it can be displayed correctly when it is used. Finally, we apply a series of rules pertaining to our in house
+style, such as changing the capitalisation of place names and normalising the way initials are displayed (the Guardian
+stipulates that they shouldn't have full stops).
+
+## What is an `ImageProcessor` pipeline?
+
+A pipeline consists of a sequence of `ImageProcessor`s applied to an image. An `ImageProcessor` is an implementation of
+a Scala trait which, most importantly, has a function of `Image => Image`. `Image` is the main representation of a
+picture in the Grid and an `ImageProcessor` allows you to modify any part of it, although it is strongly recommended
+that only the contents of the `metadata` and `usageRights` fields are actually modified.
+
+The `ImageProcessor`s are executed in the order they are listed in the configuration. The output of the first processor
+is used as input to the second processor and so on. Each `Image` is immutable so your function will return a modified
+copy which is passed as input to the next processor.
+
+### Image processor `description`
+
+The `ImageProcessor` trait also has a `description` field. This is a String which should be used to describe what the
+image processor does. This should include any use configuration of the processor. For example if a processor uses an
+external data source such as a file from an S3 bucket then it should say in the description where it comes from.
+
+The order and description of each image processor is logged during startup to provide a record of how the Grid is
+configured. This can be useful for confirming that your configuration is right and for debugging when things are not
+working as expected.
+
+## What other components are there?
+
+There are a few helper traits which can be useful for building more complex image processors.
+
+### Metadata cleaner
+
+If you only want to modify `metadata` then you can instead implement `MetadataCleaner` which has a function of
+`ImageMetadata => ImageMetadata`. This is a lightweight wrapper to avoid boilerplate.
+
+### Composing image processors
+
+If you'd like to compose your image processors in code rather than configuring them all individually at runtime (which
+benefits from better compile time safety) then you might be interested in the `ComposedImageProcessor` trait which
+includes a field allowing access to the underlying processors.
+
+There is also a convenience method `ImageProcessor.compose` or you can extend `ComposeImageProcessors` which can be
+useful if you want to create a companion object. There are examples these approaches being used in the codebase.
+
+## Recommendations
+
+We would strongly recommend that you put classification processors ahead of cleaning processors. This is because the way
+in which you classify images might be broken by later changes to your cleaning processors if the cleaning is done ahead
+of classification. If you classify first then this will not be impacted by cleaning processors run later in the
+pipeline.
diff --git a/docs/07-extending/03-authentication.md b/docs/07-extending/03-authentication.md
@@ -0,0 +1,145 @@
+Authentication and authorisation providers
+==========================================
+
+Authentication and authorisation allow the Grid to identify who is using it and what they are allowed to do.
+
+Overview
+--------
+
+### Authentication
+
+We distinguish between two types of identity:
+
+* human users represented by a `UserPrincipal` (people directly using the Grid via the UI)
+* machine users represented by a `MachinePrincipal` (automated ingest, batch processing etc. done by another
+  application)
+
+These two types of users are usually be identified using a different strategy. For example, at the Guardian we identify
+machine users by an API key in the header of an API request. Human users are typically identified by looking for a
+cookie that a user has in their browser. If they don't have the cookie, or the cookie is out of date, then we require
+them to authenticate in order to obtain a valid cookie before they can continue using the Grid.
+
+In either case, the Grid can make inter-microservice calls. In order to support this a mechanism is provided to call
+other services on behalf of a principal.
+
+### Authorisation
+
+_**Note:** Authorisation is currently a work in progress so this sketches out the current thinking only._
+
+When the Grid receives certain API requests it decides whether the principal making the request has permission to do so.
+The data which is used to make this decision can come from the principal itself, from an external source of data or a
+combination of the two.
+
+This is essentially a function of `(Principal, Action) => Boolean`. `Action` can be a simple permission or it can have a
+parameter (such as image attributes such as `uploadedBy` or `organisation`) allowing images to be visible to only
+subsets of users.
+
+Any `Principal` (human or machine) has an `identity` (such as an email address) and an `attributes` field. The latter is
+a `TypedMap` which can be used to encapsulate any permission data obtained during the authentication process. This
+permission data can then be used in the function implemented.
+
+## Implementation
+
+### Authentication
+
+There are separate providers for user and machine authentication which are configured using
+`authentication.providers.user` and `authentication.providers.machine` respectfully. The provider configured at
+`authentication.providers.user` must implement `UserAuthenticationProvider` and that configured for
+`authentication.providers.machine` must implement `MachineAuthenticationProvider`.
+
+Both providers follow a similar shape, although the user authentication is more complicated due to the additional
+support for logging a user in if they are not currently authenticated.
+
+Both traits can be found in
+[AuthenticationProvider.scala](https://github.com/guardian/grid/blob/main/common-lib/src/main/scala/com/gu/mediaservice/lib/auth/provider/AuthenticationProvider.scala)
+which will have the most up-to-date documentation. You should read the following documentation as a companion to the
+scala doc.
+
+#### UserAuthenticationProvider
+
+There are a small number of anticipated user providers (in production we'd expect installations to use one of the last
+two options):
+
+* No-auth - we'll likely implement a no-op auth provider for the purpose of demonstrating the Grid via docker
+* Basic authentication - we might also implement a very simple basic auth provider for the purpose of evaluating the
+  Grid
+* Federated auth - e.g. OIDC or SAML; this is similar to the original hardcoded authentication system in that a user is
+  sent to a third party to authenticate and then a token is returned by the user which can then be validated by the
+  authentication provider
+* Proxy auth - in this case an HTTP proxy sits in front of the application, for example
+  [oauth2-proxy](https://github.com/oauth2-proxy/oauth2-proxy) and authentication provider parses a header forwarded by
+  the proxy service
+
+##### Federated provider
+
+A federated authentication provider is likely to need to implement all provider methods.
+
+###### Example: PanDomainAuthenticationProvider
+
+The existing `PanDomainAuthenticationProvider` uses OIDC federated authentication with a cookie that sits on the "domain
+root" (note that each microservice currently sits on a separate subdomain, although it wouldn't take much effort to
+change this to have a single domain and route to individual microservices using different paths on that domain).
+Unfortunately the `PanDomainAuthenticationProvider` is tightly integrated into the Guardian's ecosystem so is unlikely
+to be useful as anything more than a starting point.
+
+If an unauthenticated user visits the Grid then they will be redirected to the OIDC service. They will return to a
+callback endpoint which validates the token from the OIDC service and sets a cryptographically signed cookie. Subsequent
+visits and API calls use the cookie to identify the user (until the cookie expires).
+
+###### Implementation
+
+In general a provider for a federated system will implement `authenticateRequest` to check for a value in
+the [Play session](https://www.playframework.com/documentation/2.8.4/ScalaSessionFlash#Storing-data-in-the-Session)
+<sup>1</sup> which avoids the need to deal with cookie signing concerns. This description assumes that this approach is
+being used.
+
+The `AuthenticationStatus` is used to signal to the Grid whether a user is authenticated (and if so, who they are) or
+not. A user can fail authentication for a number of reasons but in most cases the Grid will then send the user for
+authentication using the `sendForAuthentication` function. This will typically redirect a user to the federated
+authentication service with appropriate parameters (including the return URL). The user's browser will then take the
+user through authentication and eventually land back at the return URL on the Grid. That return URL will call the
+`sendForAuthenticationCallback` function which must validate the token returned by the federated authentication service
+prior to setting appropriate values in the Play session.
+
+There are two other methods that must be implemented: the `flushToken` endpoint should remove the authentication data
+from the play session and `onBehalfOf` must pass the whole cookie (with the name from the play config key
+`session.cookieName`) on to the downstream requests. To achieve this you will likely want to push the cookie value into
+the `attributes` map and then pull it out in much the same way as is implemented for the
+`PanDomainAuthenticationProvider` described above.
+
+<sup>1</sup> notes: for this to work you'll also need to ensure that `play.http.secret.key` is configured to be the same
+across all services and `session.domain` is set to a shared domain root; whilst the session is tamper-proof, be aware
+that data stored in the session is visible to the user.
+
+##### Proxy authentication
+
+If the Grid is behind a proxy that is handling authentication then it is likely that the provider only needs to
+implement `authenticateRequest` and `onBehalfOf`. The former will extract and validate (if necessary) the HTTP header
+containing the authentication token. This header will need to be stored in the `attributes` field of the user. The
+latter method will simply need to add the header to outgoing requests. The remaining methods can simply be implemented
+with `None`.
+
+In the case of using proxy authentication, there is no need to run the `auth` microservice.
+
+**Warning:** One remaining issue is how the authentication proxy deals with users who are not logged in or whose
+authentication has expired. When using a federated authentication service, the Grid signals to the kahuna single page
+application that the user session has expired by returning a `419` status code for any API calls. Kahuna might need to
+be modified to recognise other status codes and headers as a requirement for re-authenticating the user.
+
+#### MachineAuthenticationProvider
+
+There are also a small number of anticipated `MachineAuthenticationProviders`:
+
+* A no-op provider to allow easy use via the docker demo
+* An API key provider (the current default with keys in an S3 bucket)
+* Alternative API key providers (possibly backed by a database or using a signing mechanism rather than a plain text
+  key)
+
+In each case there are only two methods that need to be implemented. The first is the `authenticateRequest` which should
+validate the appropriate HTTP header and create the MachinePrincipal as appropriate (storing the auth header in the
+`attributes` map for downstream requests). Secondly it will need to implement the `onBehalfOf` method to allow
+downstream calls by appending the auth header to requests.
+
+## Implementing an authorisation provider
+
+The authentication provider was not merged at the time of writing these docs so the documentation doesn't yet exist.