layout | title | permalink | redirect_from | ||
---|---|---|---|---|---|
post |
ETL |
/docs/cli/etl |
|
This section documents ETL management operations with ais etl
. But first, note:
As with global rebalance, dSort, and download, all ETL management commands can be also executed via
ais job
andais show
- the commands that, by definition, support all AIS xactions, including AIS-ETL
For background on AIS-ETL, getting-started steps, working examples, and tutorials, please refer to:
- Init ETL with spec
- Init ELT with code
- List ETLs
- View ETL Logs
- Stop ETL
- Transform object on-the-fly with given ETL
- Transform a bucket offline with the given ETL
ais etl init spec --from-file=SPEC_FILE --name=ETL_NAME [--comm-type=COMMUNICATION_TYPE] [--wait-timeout=TIMEOUT] [--arg-type=ARGUMENT_TYPE]
or ais start etl init
Init ETL with Pod YAML specification file. The --name
parameter is used to assign a user defined unique name to the ETL (ref: here for information on valid ETL name).
Initialize ETL that computes MD5 of the object.
$ cat spec.yaml
apiVersion: v1
kind: Pod
metadata:
name: transformer-md5
spec:
containers:
- name: server
image: aistore/transformer_md5:latest
ports:
- name: default
containerPort: 80
command: ['/code/server.py', '--listen', '0.0.0.0', '--port', '80']
$ ais etl init spec --from-file=spec.yaml --name=transformer-md5 --comm-type=hpull:// --wait-timeout=1m
transformer-md5
ais etl init code --name=ETL_NAME --from-file=CODE_FILE --runtime=RUNTIME [--chunk-size=NUM_OF_BYTES] [--transform=TRANSFORM_FUNC] [--before=BEFORE_FUNC] [--after=AFTER_FUNC] [--deps-file=DEPS_FILE] [--comm-type=COMMUNICATION_TYPE] [--wait-timeout=TIMEOUT] [--arg-type=ARGUMENT_TYPE]
Initializes ETL from provided CODE_FILE
that contains a transformation function named transform(input_bytes)
or transform(input_bytes, context)
, an optional function executed prior to the transform function named before(context)
which is supposed to initialize all the variables needed for the transform(input_bytes, context)
and optional post transform function named after(context)
which consolidates the results and returns to the user the transformed output_bytes
.
The --name
parameter is used to assign a user defined unique name to the ETL (ref: here for information on valid ETL name).
Based on the communication type used, there are mutiple ways you can initialize the transform(input_bytes, context)
, before(context)
and after(context)
functions. Check ETL Init Code Docs for more info.
All available runtimes are listed here.
Note:
- Default value of --transform is "transform".
Initialize ETL with code that computes MD5 of the object.
$ cat code.py
import hashlib
def transform(input_bytes):
md5 = hashlib.md5()
md5.update(input_bytes)
return md5.hexdigest().encode()
$ ais etl init code --from-file=code.py --runtime=python3.11v2 --name=transformer-md5 --comm-type hpull
transformer-md5
With before(context)
and after(context)
function with streaming (CHUNK_SIZE
> 0):
$ cat code.py
import hashlib
def before(context):
context["before"] = hashlib.md5()
return context
def transform(input_bytes, context):
context["before"].update(input_bytes)
def after(context):
return context["before"].hexdigest().encode()
$ ais etl init code --name=etl-md5 --from-file=code.py --runtime=python3.11v2 --chunk-size=32768 --before=before --after=after --comm-type hpull
ais etl show
or, same, ais job show etl
Lists all available ETLs.
ais etl view-logs ETL_NAME [TARGET_ID]
Output logs produced by given ETL.
It is possible to pass an additional parameter to specify a particular TARGET_ID
from which the logs must be retrieved.
ais etl stop ETL_NAME
or, same, ais stop etl
Stop ETL with the specified id.
ais etl start ETL_NAME
or, same, ais start etl
Start ETL with the specified id.
ais etl object ETL_NAME BUCKET/OBJECT_NAME OUTPUT
Get object with ETL defined by ETL_NAME
.
Does ETL on shards/shard-0.tar
object with transformer-md5
ETL (computes MD5 of the object) and print the output to the STDOUT.
$ ais etl object transformer-md5 ais://shards/shard-0.tar -
393c6706efb128fbc442d3f7d084a426
Do ETL on the shards/shard-0.tar
object with transformer-md5
ETL (computes MD5 of the object) and save the output to the output.txt
file.
$ ais etl object transformer-md5 ais://shards/shard-0.tar output.txt
$ cat output.txt
393c6706efb128fbc442d3f7d084a426
ais etl bucket ETL_NAME SRC_BUCKET DST_BUCKET
Transform all or selected objects and put them into another bucket.
Flag | Type | Description |
---|---|---|
--list |
string |
Comma-separated list of object names, e.g., 'obj1,obj2' |
--template |
string |
Template for matching object names, e.g, 'obj-{000..100}.tar' |
--ext |
string |
Mapping from old to new extensions of transformed objects, e.g. {jpg:txt}, "{ in1 : out1, in2 : out2 }" |
--prefix |
string |
Prefix added to every new object name |
--wait |
bool |
Wait until operation is finished |
--requests-timeout |
duration |
Timeout for a single object transformation |
--dry-run |
bool |
Don't actually transform the bucket, only display what would happen |
Flags --list
and --template
are mutually exclusive. If neither of them is set, the command transforms the whole bucket.
Transform every object from src_bucket
with ETL and put new objects to dst_bucket
.
$ ais etl bucket transformer-md5 ais://src_bucket ais://dst_bucket
MMi9l8Z11
$ ais wait xaction MMi9l8Z11
The same as above, but wait for the ETL bucket to finish.
$ ais etl bucket transformer-md5 ais://src_bucket ais://dst_bucket --wait
Transform objects shard-10.tar
, shard-11.tar
, and shard-12.tar
from src_bucket
with ETL and put new objects to dst_bucket
.
$ ais etl bucket transformer-md5 ais://src_bucket ais://dst_bucket --template "shard-{10..12}.tar"
The same as above, but objects will have etl-
prefix and objects with extension .in1
will have .out1
extension, objects with extension .in2
will have .out2
extension.
$ ais ls ais://src_bucket --props=name
NAME
obj1.in1
obj2.in2
(...)
$ ais etl bucket transformer-md5 ais://src_bucket ais://dst_bucket --ext="{in1:out1, in2:out2}" --prefix="etl-" --wait
$ ais ls ais://dst_bucket --props=name
NAME
etl-obj1.out1
etl-obj2.out2
(...)
Dry-run won't perform any actions but rather just show what would be transformed if we actually transformed a bucket. This is useful for preparing the actual run.
$ ais ls ais://src_bucket --props=name,size
NAME SIZE
obj1.in1 10MiB
obj2.in2 10MiB
(...)
$ ais etl bucket transformer-md5 ais://src_bucket ais://dst_bucket --dry-run --wait
[DRY RUN] No modifications on the cluster
2 objects (20MiB) would have been put into bucket ais://dst_bucket