Datacube Alchemist is a command line application for performing Dataset to Dataset transformations in the context of an Open Data Cube system.
It uses a configuration file which specifies an input Product or Products, a Transformation to perform, and output parameters and destination.
Features
- Writes output to Cloud Optimised GeoTIFFs
- Easily runs within a Docker Container
- Parallelism using AWS SQS queues and Kubernetes
- Write output data to S3 or a file system
- Generates
eo3
format dataset metadata, along with processing information - Generates STAC 1.0.0.beta2 dataset metadata
- Configurable thumbnail generation
- Pass any command line options as Environment Variables
You can build the docker image locally with Docker or Docker Compose. The commands are
docker build --tag opendatacube/datacube-alchemist .
or docker-compose build
.
There's a Python setup file, so you can do pip3 install .
in the root folder. You will
need to ensure that the Open Data Cube and all its dependencies happily install though.
To run some example processes you can use the Docker Compose file to create a local workspace. To start the workspace and run an example, you can do the following:
- Export the environment variables
ODC_ACCESS_KEY
andODC_SECRET_KEY
with valid AWS credentials - Run
make up
ordocker-compose up
to start the postgres and datacube-alchemist Docker containers make initdb
to initialise the ODC database (or see the Makefile for the specific command)make metadata
will add the metadata that the Landsat example product needsmake product
will add the Landsat product definitionsmake index
will index a range of Landsat scenes to test processing withmake wofs-one
ormake fc-one
will process a single Fractional Cover or Water Observations from Space scene and output the results to the ./examples folder in this project directory
Datacube Alchemist is used in production by the Digital Earth Australia and Digital Earth Africa programs
Notes on queues. To run jobs from an SQS queue, good practice is to create a deadletter queue as well as a main queue. Jobs (messages) get picked up off the main queue, and if they're successful, then they're deleted. If they aren't successful, they're not deleted, and they go back on the main queue after a defined amount of time. If this happens more than the defined number of times then the message is moved to the deadletter queue. In this way, you can track work completion.
Note that the --config-file
can be a local path or a URI.
Usage: datacube-alchemist run-one [OPTIONS]
Run with the config file for one input_dataset (by UUID)
Options:
-c, --config-file TEXT The path (URI or file) to a config file to use for the
job [required]
-u, --uuid TEXT UUID of the scene to be processed [required]
--dryrun, --no-dryrun Don't actually do real work
--help Show this message and exit.
Note that --dryrun
is optional, and will run a 1/10 scale load and will not
write output to the final destination.
datacube-alchemist run-one \
--config-file ./examples/c3_config_wo.yaml \
--uuid 7b9553d4-3367-43fe-8e6f-b45999c5ada6 \
--dryrun \
Note that the final argument is a datacube expression , see Datacube Search documentation.
Usage: datacube-alchemist run-many [OPTIONS] [EXPRESSIONS]...
Run Alchemist with the config file on all the Datasets matching an ODC query
expression
EXPRESSIONS
Select datasets using [EXPRESSIONS] to filter by date, product type, spatial
extents or other searchable fields.
FIELD = VALUE
FIELD in DATE-RANGE
FIELD in [START, END]
TIME < DATE
TIME > DATE
START and END can be either numbers or dates
Dates follow YYYY, YYYY-MM, or YYYY-MM-DD format
FIELD: x, y, lat, lon, time, product, ...
eg. 'time in [1996-01-01, 1996-12-31]'
'time in 1996'
'time > 2020-01'
'lon in [130, 140]' 'lat in [-40, -30]'
product=ls5_nbar_albers
Options:
-c, --config-file TEXT The path (URI or file) to a config file to use for the
job [required]
-l, --limit INTEGER For testing, limit the number of tasks to create or
process.
--dryrun, --no-dryrun Don't actually do real work
--help Show this message and exit.
Example
datacube-alchemist run-many \
--config-file ./examples/c3_config_wo.yaml \
--limit=2 \
--dryrun \
time in 2020-01
Notes on queues. To run jobs from an SQS queue, good practice is to create a deadletter queue as well as a main queue. Jobs (messages) get picked up off the main queue, and if they're successful, then they're deleted. If they aren't successful, they're not deleted, and they go back on the main queue after a defined amount of time. If this happens more than the defined number of times then the message is moved to the deadletter queue. In this way, you can track work completion.
Usage: datacube-alchemist run-from-queue [OPTIONS]
Process messages from the given queue
Options:
-c, --config-file TEXT The path (URI or file) to a config file to use
for the job [required]
-q, --queue TEXT Name of an AWS SQS Message Queue [required]
-l, --limit INTEGER For testing, limit the number of tasks to create
or process.
-s, --queue-timeout INTEGER The SQS message Visibility Timeout in seconds,
default is 600, or 10 minutes.
--dryrun, --no-dryrun Don't actually do real work
--sns-arn TEXT Publish resulting STAC document to an SNS
--help Show this message and exit.
Example
datacube-alchemist run-from-queue \
--config-file ./examples/c3_config_wo.yaml \
--queue example-queue-name \
--limit=1 \
--queue-timeout=600 \
--dryrun
Search for Datasets and enqueue Tasks into an AWS SQS Queue for later processing.
The --limit
is the total number of datasets to limit to, whereas the --product-limit
is
the number of datasets per product, in the case that you have multiple input products.
Usage: datacube-alchemist add-to-queue [OPTIONS] [EXPRESSIONS]...
Search for Datasets and enqueue Tasks into an AWS SQS Queue for later
processing.
EXPRESSIONS
Select datasets using [EXPRESSIONS] to filter by date, product type, spatial
extents or other searchable fields.
FIELD = VALUE
FIELD in DATE-RANGE
FIELD in [START, END]
TIME < DATE
TIME > DATE
START and END can be either numbers or dates
Dates follow YYYY, YYYY-MM, or YYYY-MM-DD format
FIELD: x, y, lat, lon, time, product, ...
eg. 'time in [1996-01-01, 1996-12-31]'
'time in 1996'
'time > 2020-01'
'lon in [130, 140]' 'lat in [-40, -30]'
product=ls5_nbar_albers
Options:
-c, --config-file TEXT The path (URI or file) to a config file to use
for the job [required]
-q, --queue TEXT Name of an AWS SQS Message Queue [required]
-l, --limit INTEGER For testing, limit the number of tasks to create
or process.
-p, --product-limit INTEGER For testing, limit the number of datasets per
product.
--dryrun, --no-dryrun Don't actually do real work
--help Show this message and exit.
Example
datacube-alchemist add-to-queue \
--config-file ./examples/c3_config_wo.yaml \
--queue example-queue-name \
--limit=300 \
--product-limit=100
Redrives messages from an SQS queue.
All the messages in the specified queue are re-transmitted to either their original queue or the specified TO-QUEUE.
Be careful when manually specifying TO-QUEUE, as it's easy to mistakenly push tasks to the wrong queue, eg. One that will process them with an incorrect configuration file.
Usage: datacube-alchemist redrive-to-queue [OPTIONS]
Redrives all the messages from the given sqs queue to their source, or the
target queue
Options:
-q, --queue TEXT Name of an AWS SQS Message Queue [required]
-l, --limit INTEGER For testing, limit the number of tasks to create or
process.
-t, --to-queue TEXT Url of SQS Queue to move to
--dryrun, --no-dryrun Don't actually do real work
--help Show this message and exit.
Example
datacube-alchemist redrive-to-queue \
--queue example-from-queue \
--to-queue example-to-queue
Search for datasets that don't have a target product dataset and add them to the queue
If a predicate is supplied, datasets which do not match are filtered out.
The predicate is a Python expression that should return True or False, which has a single
dataset available as the variable d
.
Example predicates:
d.metadata.gqa_iterative_mean_xy <= 1
d.metadata.gqa_iterative_mean_xy and ('2022-06-30' <= str(d.center_time.date()) <= '2023-07-01')
d.metadata.dataset_maturity == "final"
Usage: datacube-alchemist add-missing-to-queue [OPTIONS]
Search for datasets that don't have a target product dataset and add them to
the queue
If a predicate is supplied, datasets which do not match are filtered out.
Example predicate: - 'd.metadata.gqa_iterative_mean_xy <= 1'
Options:
--predicate TEXT Python predicate to filter datasets. Dataset is
available as "d"
-c, --config-file TEXT The path (URI or file) to a config file to use for the
job [required]
-q, --queue TEXT Name of an AWS SQS Message Queue [required]
--dryrun, --no-dryrun Don't actually do real work
--help Show this message and exit.
A YAML file with 3 sections:
specification
- Define the inputs and algorithmoutput
- Output location and format optionsprocessing
- Optimise CPU/Memory requirements
Datacube Alchemist requires a configuration file in YAML format, to setup the Algorithm or Transformation, the input Dataset/s, as well as details of the outputs including metadata, destination and preview image generation.
The configuration file has 3 sections. specification
sets up the input ODC product, data bands and
configured algorithm to run . output
sets where the output files will be written, how the preview image will be
created, and what extra metadata to include. processing
can help configure the tasks memory and CPU requirements..
Defines the input data and the algorithm to process it.
product or products Names of the
measurements: [list] of measurement names to load from the input products
measurement_renames: [map] rename measurements from the input data before passing to the transform
transform: [string] fully qualified name of a Python class implementing the transform
transform_url: [string] Reference URL for the Transform, to record in the output metadata
override_product_family: Override part of the metadata (should be in )
basis: ????
transform_args: [map] Named arguments to pass to the Transformer class
specification:
products:
- ga_ls5t_ard_3
- ga_ls7e_ard_3
- ga_ls8c_ard_3
measurements: ['nbart_blue', 'nbart_green', 'nbart_red', 'nbart_nir', 'nbart_swir_1', 'nbart_swir_2', 'oa_fmask']
measurement_renames:
oa_fmask: fmask
aws_unsigned: False
transform: wofs.virtualproduct.WOfSClassifier
transform_url: 'https://github.com/GeoscienceAustralia/wofs/'
override_product_family: ard
basis: nbart_green
transform_args:
dsm_path: 's3://dea-non-public-data/dsm/dsm1sv1_0_Clean.tiff'
Apache License 2.0
© 2021, Open Data Cube Community