Skip to content

ODC EP 010 Replace Configuration Layer

Matt Paget edited this page Apr 28, 2023 · 39 revisions

ODC-EP 10 - Replace configuration layer

Overview

A ground-up, compatibility-breaking rewrite of the configuration layer.

Proposed By

Paul Haesler (@SpacemanPaul)

State

  • In draft
  • Under Discussion
  • In Progress
  • Completed
  • Rejected
  • Deferred

Motivation

The existing configuration layer is complex, inconsistent and poorly documented. Further details on the behaviour in 1.8.x can be found in Issue #1258. Any effort to fully and accurately document the existing behaviour would likely result in confusing and unreadable documentation.

1.9 (and 2.0) is a good opportunity to retire accumulated technical debt and replace the existing code with something more consistent and maintainable without being weighed down by backwards compatibility.

Configuration features/quirks in v1.8.x

  1. One or more ODC configuration files in INI File Format, implemented using the Python configparser library from the Python standard library.
  2. Configuration files can contain:
  • A special "user" section specifying a default environment.
  • A named section per environment, where each environment can specify (1) which index driver to use and (2) any required connection information required for the database backend.
  1. Ability to merge environments from multiple configuration files. (Inconsistently exposed. Available through the CLI but not directly through the Datacube() constructor.)
  2. Default config search path and environments are defined if user supplies neither. Exact fall-back rules are convoluted.
  3. Can inject config directly with environment variables. This behaviour is poorly documented and interacts inconsistently and/or unexpectedly with 3 and 4 above and some configuration items are not configurable with environment variables (in particular selecting an index driver other than the default).
  4. $DATACUBE_CONFIG_PATH environment variable allows setting a single file location which sits at a fixed place in search path.
  5. The configuration layer is only used for configuring the index backend. Other ODC configuration (e.g. AWS/S3/rasterio configuration, etc.) is handled separately.

Design Concerns

Single file vs multiple/merged config files

A multi-file implementation provides some desirable features for large centrally managed installation, e.g. NCI and (to a lesser extent) DEA Sandbox. However it can lead to confusion about where the current configuration is actually coming from, and makes the interaction between configuration from files and from environment variables complex and confusing.

Given that the confusing and complex nature of the current implementation is a driving force behind this EP, a single file solution is preferred. Large centrally managed installations should advise users to make a copy of the default configuration file and modify it, rather than creating a new configuration file that is read in conjunction with the default file.

Config file format

The current Windows INI style config format only supports a single layer of hierarchy, which places limits on what other (i.e. non-index-layer-specific) configuration can be added to the configuration layer.

Given the existing heavy use of YAML in ODC, a switch to a YAML-based configuration file format is worth considering.

Advantages of a switch to YAML include:

a) Can package config in a string without \n newlines everywhere. b) Arbitrary-depth nested hierarchies

Nested hierarchy is not needed for simply configuring index connections, which is all config is currently used for. But currently we only have one global config for cloud access (e.g. AWS/S3) settings. It is not unreasonable to want to be able to store data which requires different AWS/S3 settings in the same index. STAC currently supports this, and we will need to support it to be enable tighter STAC/ODC integration. Allowing per-index-enviroment settings would be an improvement. STAC stores these per-"dataset", equivalent to storing with the data uri/location in ODC, but some sort of per-provider/bucket configuration option seems preferable - this would be extremely unwieldy to implement in an INI based deployment.

This EP proposes supporting both INI and (non-nested) YAML in 1.9, with the INI format deprecated, then YAML-only config from 2.0.

N.B. Config file examples below currently still use INI format.

Interaction Between Environment Variables and Config Files.

Configuration via environment variables is essential in e.g. cloud-deployed environments where leaking of credentials is a serious risk, and is therefore a required feature.

The existing interaction is quite complex and unexpected. E.g. environment variables are not used at all if a config file is explicitly specified, but are merged on top default config files.

It is important to consider that we now need to allow for multiple indexes to be in use at once.

Proposal

A. Contents of configuration

A config file consists of environments. An environment may be configured independently, or can be defined as an alias to another existing environment.

The "user" section no longer has a special meaning (as it is no longer relevant when config files are not merged.)

[default]
   alias: prod

[prod]
   db_hostname: prod.dbs.example.net
   db_database: odc_prod
   db_user: cube
   db_password: secret_squirrel
   
[dev]
   index_driver: postgis
   db_hostname: dev.dbs.example.net
   db_database: odc_dev

   db_user: cube
   db_port: 5432
   iam_authentication: y
   iam_timeout: 300

[temp]
   index_driver: memory

Restrictions on environment names:

  • all alphabetic characters must be all lower case
  • must not contain an underscore

Restrictions on configuration fields:

  • all alphabetic characters must be all lower case

(The reasons for these restrictions will be explained in section 4 below.)

Configuring database details as a single database url instead of separate hostname, database, username and password.

Some index drivers (initially the postgres and postgis index drivers) will support supplying connection details as a single connection url. If a url is provided, it overrides any individual db_* fields provided for that environment. The format of the database url will depend on the index driver, but for both postgres and postgis drivers will be:

postgresql://[username]:[password]@[hostname]:[port]/[database]

Or for passwordless access to a database on localhost:

postgresql:///[database]

E.g.

[myenv]
    index_driver: postgis
    url: postgresql://user:[email protected]:5432/mydb
    db_database: will_be_overridden
    db_password: this_is_not_used_either

is equivalent to

[myenv]
    index_driver: postgis
    db_hostname: hostname.domain
    db_database: mydb
    db_username: user
    db_port: 5432
    db_password: insecure_password

The url can also be supplied in a generic environment variable (See step 4 below).

Question: Should we deprecate (in 1.9 and removed in 2.0) the db_* config entries in favour of the single url approach or continue supporting both?

B. Config loading/reading process

1. Bypassing all configuration files (explicit config text)

Configuration file text may be supplied directly, without an actual on-disk config file. If configuration is supplied using these methods, no further config processing is performed, i.e. steps 2-4 below are skipped.

  • In Python: dc = Datacube(config_text="[default]\ndb_hostname....")
  • Via CLI: datacube --config "`config_file_generator --option blah`"
  • Via Environment variable: ODC_CONFIG="`config_file_generator --option blag`"

CLI option or Datacube argument overrides environment variable $ODC_CONFIG. If none of the above are provided, on-disk files and/or environment variables are read, as per the steps described below.

2. File Finder

If explicit config text was not provided, we need to find a config file in the file system.

This design is a one-file-only design, but could be fairly readily modified to do merging of multiple files. The main change is to reverse the order of path loading. (Single file means high priority locations are read first, finishing after the first file is found. Multi-file means low priority locations are read first, and we keep reading through all locations.) - But merging with other locations should never be performed if config was passed in explicitly in Step 1 above.

2a. Explicit file locations

Either as a single path:

  • In Python: dc = Datacube(config="/path/to/configfile")
  • Via CLI: datacube -C /path/to/configfile
  • Via Environment Variable: ODC_CONFIG_FILE=/path/to/configfile
  • Via Legacy Environment Variable: DATACUBE_CONFIG_PATH (with deprecation and behaviour change warning)

Or a priority list of paths:

  • In Python: dc = Datacube(config=['/path/to/override_config', '/path/to/default_config']) NEW
  • Via CLI: datacube -C /path/to/override_config -C /path/to/default_config
  • Via Environment Variable (like a UNIX PATH): ODC_CONFIG_PATH=/path/to/override_config:/path/to/default_config NEW

The possible locations are searched in the order provided and the first to exist in the file system is used. No merging is performed.

If config locations are provided and none of the files exist, an error is raised.

2b. Default file locations.

If no config file locations are provided, the following default priority path list is used. (The first in the list found is used, again no merging is performed.)

  • datacube.conf in the current working directory.
  • ~/.datacube.conf
  • /etc/default/datacube.conf NEW
  • /etc/datacube.conf

If no config file locations are provided, and none of the above exist, a minimal default config (datacube.config.DEFAULT_CONFIG) is used.

3. Choosing which environment to use.

3a. Explictly provided environment

The user may explicitly specify an environment:

  • In Python: dc = Datacube(env="dev")
  • Via CLI: datacube -E dev
  • Via Environment Variable: ODC_ENVIRONMENT=dev
  • Via Legacy Environment Variable: DATACUBE_ENVIRONMENT (with deprecation warning)

Environment variables are only read if environment not explicitly passed in by Python or CLI.

3b. Default behaviour when no environment is explicitly specified.

  1. The default environment is "default".
  2. If there is no environment (or environment alias) called "default", then the "datacube" environment is used if it exists (with a deprecation warning.)

If neither default or datacube environments exist (and no environment is explicitly specified) an error is raised.

I have removed the "default_environment" setting in the "user" section of the config file because it doesn't make sense in the absence of file merging (and it makes the contents of the config file simpler and more consistent), but this pathway for specifying environment could be restored if go back to a file-merging approach.

4. Config via Generic Config Environment Variables

Any configuration field not in the active config file can be supplied by (or any field in the active config file overridden by) a generic config environment variable named:

$ODC_[environment_name_or_alias]_[field_name]

Both names/aliases are converted to upper case for the environment variable name.

E.g. Given the following contents of the active config file:

[default]
   alias: prod

[prod]
   db_hostname: prod.dbs.example.net
   db_database: odc_prod
   db_username: odc
   db_password: insecure_passwd1

[dev]
   db_hostname: dev.dbs.example.net
   db_database: odc_dev
   db_username: odc

[temp]
   index_driver: memory

AND the following environment variable values:

# This could be specified as ODC_DEFAULT_DB_PASSWORD or ODC_PROD_DB_PASSWORD
# If both are supplied the non-alias one (ODC_PROD_DB_PASSWORD) takes precedence.
ODC_DEFAULT_DB_PASSWORD=secret_and_secure
ODC_PROD_DB_HOSTNAME=production.dbs.internal

ODC_DEV_IAM_AUTHENTICATION=y
ODC_DEV_IAM_TIMEOUT=3600

ODC_DYNENV_DB_HOSTNAME=another.dbs.example.com
ODC_DYNENV_DB_USERNAME=odc
ODC_DYNENV_DB_PASSWORD=secure_and_secret
ODC_DYNENV_DB_DATABASE=other

Then the effective value of the configuration is:

[default]
   alias: prod

[prod]
   db_hostname: production.dbs.internal
   db_database: odc_prod
   db_username: odc
   db_password: secret_and_secure

[dev]
   db_hostname: dev.dbs.example.net
   db_database: odc_dev
   db_username: odc
   iam_authentication: y
   iam_timeout: 3600

[temp]
   index_driver: memory

[dynenv]
   db_hostname: another.dbs.example.com
   db_username: odc
   db_password: secure_and_secret
   db_database: other

Notes:

  • Operationally the config layer will only know about the dynenv environment if the user explicitly requests it.
  • Although new environments can be defined dynamically with environment variables, creating or overriding aliases with environment variables will be forbidden as it creates too many implementation-specific corner-cases in behaviour.
  • The legacy $DB_DATABASE, $DB_HOSTNAME, $DB_PASSWORD, etc. environment variables are explicit aliases for $ODC_DEFAULT_DB_HOSTNAME, $ODC_DEFAULT_HOSTNAME, $ODC_DEFAULT_PASSWORD, etc. (Note this is not exactly the same as the current implementation, but is similar enough that no deprecation warning is necessary.
  • The database url (as discussed above) can be passed in by environmnet variable: ODC_MYENV_URL=postgresql://user:[email protected]:5432/mydb The legacy $DATACUBE_DB_URL environment variable is an explicit alias for $ODC_DEFAULT_URL.

Feedback

Damien Ayers (2023-04-21)

Paul and I have discussed this EP prior to it's drafting. Given the complexity and limitations of the current configuration system, my feeling is that we should scrap the implementation of the current system, and clearly define a simpler system before implementing it.

On the table for discussion,

  • Should we look for configuration files in multiple places? I think that this is worth having, so yes.
  • For a simple system, I think we're much better off with INI style than YAML.

Specific points:

  • I think we should ditch the multiple file overlay system. It's too hard to reason about.
  • Requirements: we must allow configuration via Environment Variables as well as via a file.

Matt Paget (2023-04-28)

Looks great! Some comments:

  • The environment variables could potentially get messy with lots of variants for different ODC environments. For a system/deployment admin, the new /etc/default/datacube.conf could be more suitable (e.g., a file managed by puppet etc).
    • I might suggest that the docs could present the ODC_DEFAULT_* env vars as an available fallback (as noted above). Then mention that other ODC_[environment]_* env vars can be used too but with a note of caution that /etc/default/datacube.conf might be more suitable for administrators.
  • It would be helpful to expose the datacube config reconciling function(s) so that the resulting (db) values can be used by ODC repos and custom code. Perhaps the "API" aspect of the config reconciling could be described above as well?

Voting

Enhancement Proposal Team

  • Paul Haesler (@SpacemanPaul)

Links

Clone this wiki locally