This page specifies the configurations available in Envelope.
As illustration, a typical Envelope batch application that reads HDFS JSON files, extracts a subset of data, and writes the results to S3 in Parquet might have the following configuration.
application { name = Envelope configuration example executor.instances = 3 executor.memory = 4G } steps { exampleInput { input { type = filesystem path = "hdfs://..." format = json } } exampleStep { dependencies = [exampleInput] deriver { type = sql query.literal = "SELECT MY_UPPER(foo) AS foo FROM exampleInput WHERE MY_LOWER(bar) = 'blag'" } planner { type = append } output { type = filesystem path = "s3a://..." format = parquet } } } udfs : [ { name = my_upper class = com... }, { name = my_lower class = com... } ]
Application-level configurations have the application.
prefix.
Configuration suffix | Description |
---|---|
name |
The application name in YARN. |
executor.instances |
The number of executors to be requested for the application. If not specified then Spark dynamic allocation will be used. |
executor.initial.instances |
The initial number of executors to be requested for the application when using Spark dynamic allocation. This can help more quickly warm up the job’s resources without using a static allocation. |
executor.cores |
The number of cores per executor. Default is 1. |
executor.memory |
The amount of memory per executor. Default is 1G. |
batch.milliseconds |
The length of the micro-batch in milliseconds. Default is 1000. Ignored if the application does not have a streaming input. |
pipeline.threads |
The number of threads that Envelope will use to run pipeline steps. This is effectively a limit on the number of outputs that can be writing at once. Default is 20. |
spark.conf.* |
Used to pass configurations directly to Spark. The |
hive.enabled |
Enables Hive support. Default is true. Must be enabled before reading and writing data stored in Apache Hive. Setting the value to false when Hive integration is not required avoids the associated overhead. |
configuration.validation.enabled |
Enables upfront validation of the provided Envelope configuration. Default is true. |
driver.memory |
The amount of memory allocated for a Spark driver. Please note that this configuration is only applicable when application is deployed in cluster mode, and will cause an exception if deployment mode is client. To set driver memory for the applications running in client mode, use Spark’s command line argument --driver-memory. |
security.check-interval |
How often the security token manager in the driver checks if tokens need refreshing. Default is "60s". Accepts any Typesafe duration string. Recommended to leave at default. |
security.renew-factor |
At what proportion of a token’s lifetime to request a new token. Defaults to 0.8. Recommended to leave at default. |
config-loader |
The config loader object that will provide configurations to be merged into the base configuration at the start of the pipeline and for every micro-batch. Note this is typically only required when streaming pipelines require dynamic refreshing of configurations. |
Step configurations have the steps.[stepname].
prefix. All steps can have the below configurations.
Configuration suffix | Description |
---|---|
type |
The step type. Envelope supports |
dependencies |
The list of step names that Envelope will submit before submitting this step. |
Data steps can, additionally to the step configurations, have the below configurations.
Configuration suffix | Description |
---|---|
cache.enabled |
If |
cache.storage.level |
If specified then Envelope will change the step’s DataFrame cache storage levels to value specified.
Available storage levels are |
hint.small |
If |
print.schema.enabled |
If |
print.data.enabled |
If |
print.data.limit |
The maximum number of records to print when |
repartition.partitions |
The number of DataFrame partitions to repartition the step data by. In Spark this will run |
repartition.columns |
A list of DataFrame columns to repartition the step data by. In Spark this will run |
coalesce.partitions |
The number of DataFrame partitions to coalesce the step data by. In Spark this will run |
Loop steps can, additionally to the step configurations, have the below configurations. For more information on loop steps see the looping guide.
Configuration suffix | Description |
---|---|
mode |
The mode for Envelope to run the iterations of the loop in. If |
parameter |
The parameter that Envelope will replace in strings in the configuration of the steps that are dependent on the loop step. For a parameter value |
source |
The source of the iteration values for the loop. Envelope supports |
range.start |
If using the |
range.end |
If using the |
list |
If using the |
step |
If using the |
Decision steps can, additionally to the step configurations, have the below configurations. For more information on decision steps see the decisions guide.
Configuration suffix | Description |
---|---|
if-true-steps |
Required. The list of dependent step names that will be kept if the decision result is true. The steps listed must directly depend on the decision step. The remaining directly dependent steps of the decision step will be kept if the decision result is false. Any steps subsequently dependent on the removed steps will also be removed. |
method |
Required. The method by which the decision step will make the decision. Envelope supports |
result |
Required if |
step |
Required if |
key |
Required if |
Task steps can, additionally to the step configurations, have the below configurations. For more information on task steps see the tasks guide.
Configuration suffix | Description |
---|---|
class |
Required. The alias or fully qualified class name of the |
Input configurations belong to data steps, and have the steps.[stepname].input.
prefix. For more information on inputs see the inputs guide.
Configuration suffix | Description |
---|---|
type |
The input type to be used. Envelope provides |
Input type
= filesystem
.
Configuration suffix | Description |
---|---|
path |
The Hadoop filesystem path to read as the input. Typically a Cloudera EDH will point to HDFS by default. Use |
format |
The file format of the files of the input directory. Envelope supports formats |
schema |
Optional. Applies to |
separator |
(csv) Spark option |
encoding |
(csv) Spark option |
quote |
(csv) Spark option |
escape |
(csv) Spark option |
comment |
(csv) Spark option |
header |
(csv) Spark option |
infer-schema |
(csv) Spark option |
ignore-leading-ws |
(csv) Spark option |
ignore-trailing-ws |
(csv) Spark option |
null-value |
(csv) Spark option |
nan-value |
(csv) Spark option |
positive-infinity |
(csv) Spark option |
negative-infinity |
(csv) Spark option |
date-format |
(csv) Spark option |
timestamp-format |
(csv) Spark option |
max-columns |
(csv) Spark option |
max-chars-per-column |
(csv) Spark option |
max-malformed-logged |
(csv) Spark option |
mode |
(csv) Spark option
(default |
format-class |
(input-format) The |
translator |
(input-format, text) The Translator class to use to convert the InputFormat’s Key/Value pairs into Dataset Rows. See Translators for details. This is optional for |
Input type
= hive
.
Configuration suffix | Description |
---|---|
table |
The Hive metastore table name (including database prefix, if required) to read as the input. |
Input type
= jdbc
.
Configuration suffix | Description |
---|---|
url |
The JDBC URL for the remote database. |
tablename |
The name of the table of the remote database to be read as the input. |
username |
The username to use to connect to the remote database. |
password |
The password to use to connect to the remote database. |
Input type
= kafka
.
Configuration suffix | Description |
---|---|
brokers |
The hosts and ports of the brokers of the Kafka cluster, in the form |
topics |
The list of Kafka topics to be consumed. |
group.id |
The Kafka consumer group ID for the input. When offset management is enabled use a unique group ID for each pipeline so that Envelope can track one execution of the pipeline to the next. If not provided Envelope will use a random UUID for each pipeline execution. |
window.enabled |
If |
window.milliseconds |
The duration in milliseconds of the Spark Streaming window for the input. |
window.slide.milliseconds |
The interval in milliseconds at which the Spark Streaming window operation is performed if using sliding windows. |
offsets.manage |
If |
offsets.output |
If |
parameter.* |
Used to pass configurations directly to Kafka. The |
Translator configurations belong to data steps, and have the steps.[stepname].input.translator.
prefix. For more information on translators, see the Translators section of the Inputs Guide.
Configuration suffix | Description |
---|---|
type |
The translator type to be used. Envelope provides |
append.raw.enabled |
If |
Translator type
= avro
.
Configuration suffix | Description |
---|---|
schema |
The schema to translate to. Refer to the Schema documentation |
Translator type
= delimited
.
Configuration suffix | Description |
---|---|
delimiter |
The delimiter that separates the fields of the message. |
delimiter-regex |
If |
schema |
The schema to translate to. Refer to the Schema documentation |
timestamp.formats |
Optional list of timestamp format patterns. For timestamp field type, one or more patterns may be supplied in Joda timestamp format. If this configuration is supplied, timestamp format must confirm to one of these pattens to be considered validity. For performance sensitive processing, list patterns in order of probability of occurrence. If this configuration is not supplied, timestamp data must confirm to ISO 8601 date, time or datetime format. |
Translator type
= kvp
.
Configuration suffix | Description |
---|---|
delimiter.kvp |
The delimiter that separates the key-value pairs of the message. |
delimiter.field |
The delimiter that separates the the key and value of each key-value pair. |
schema |
The schema to translate to. Refer to the Schema documentation |
timestamp.formats |
Optional list of timestamp format patterns. For timestamp field type, one or more patterns may be supplied in Joda timestamp format. If this configuration is supplied, timestamp format must confirm to one of these pattens to be considered validity. For performance sensitive processing, list patterns in order of probability of occurrence. If this configuration is not supplied, timestamp data must confirm to ISO 8601 date, time or datetime format. |
Translator type
= morphline
.
Configuration suffix | Description |
---|---|
encoding.key |
The character set of the incoming key and is stored in the Record field, |
encoding.message |
The character set of the incoming message and is stored in the Record field, |
morphline.file |
The filename of the Morphline configuration found in the local directory of the executor. See the |
morphline.id |
The optional identifier of the Morphline pipeline within the configuration file. |
schema |
The schema to translate to. Refer to the Schema documentation |
error.on.empty |
If |
Translator type
= protobuf
.
Configuration suffix | Description |
---|---|
schema |
The schema to translate to. Refer to the Schema documentation (Currently only the Protobuf schema type is supported for the Protobuf translator, |
Deriver configurations belong to data steps, and have the steps.[stepname].deriver.
prefix. For more information on derivers see the derivers guide.
Configuration suffix | Description |
---|---|
type |
The deriver type to be used. Envelope provides |
Deriver type
= morphline
.
Configuration suffix | Description |
---|---|
step.name |
The name of the dependency step whose records will be run through the Morphline pipeline. |
morphline.file |
The filename of the Morphline configuration found in the local directory of the executor. See the |
morphline.id |
The optional identifier of the Morphline pipeline within the configuration file. |
schema |
The schema definition. Refer to the Schema documentation |
Deriver type
= nest
.
Configuration suffix | Description |
---|---|
nest.into |
The name of the step whose records will be appended with the nesting of |
nest.from |
The name of the step whose records will be nested into |
key.field.names |
The list of field names that make up the common key of the two steps. This key will be used to determine which |
nested.field.name |
The name to be given to the appended field that contains the nested records. |
Deriver type
= sql
.
Configuration suffix | Description |
---|---|
query.literal |
The literal query to be submitted to Spark SQL. Previously submitted steps can be referenced as tables by their step name. |
query.file |
The path to the file containing the query to be submitted to Spark SQL. |
parameter.parameter_name (or any parameter.*) |
All references to '${parameter_name}' within the query string will be replaced with the value of this configuration. For more information see the derivers guide. |
Deriver type
= pivot
.
Configuration suffix | Description |
---|---|
step.name |
The name of the dependency step that will be pivoted. |
entity.key.field.names |
The list of field names that represents the entity key to group on. The derived DataFrame will contain one record per distinct entity key. |
pivot.key.field.name |
The field name of the key to pivot on. It is expected that there will only be one of each pivot key per entity key. The derived DataFrame will contain one additional column per distinct pivot key. |
pivot.value.field.name |
The field name of the value to be pivoted. |
pivot.keys.source |
The source of the keys to pivot into additional columns. If |
pivot.keys.list |
The list of keys to pivot into additional columns. Only used if |
Deriver type
= exclude
.
Configuration suffix | Description |
---|---|
compare |
The name of the dataset whose records will be compared and if matched, then excluded from the output of the current step. |
with |
The name of the dataset whose records will supply the matching patterns for the comparison. The records are not modified; this step only queries the dataset. |
field.names |
The name of the fields used to match between the two datasets. The field names must be identical in name and type. A row is excluded if all of the fields are equal between the datasets. |
Deriver type
= select
.
Configuration suffix | Description |
---|---|
step |
The name of the dependency step from which to select columns as output of the current step. |
include-fields |
List of column names that are required in output for the current step. If input dataset schema doesn’t contain column name(s) then deriver will generate a runtime error. |
exclude-fields |
List of column names that are not required in output for the current step. If input dataset schema doesn’t contain column name(s) then deriver will generate a runtime error. Both include-fields and exclude-fields cannot be provided at same time. |
Deriver type
= dq
.
Configuration suffix | Description |
---|---|
scope |
Required. The scope at which to apply the DQ deriver. |
rules |
Required. A nested object of rules. Each defined object should contain a field |
checknulls |
|
fields |
Required. The list of fields to check. The contents should be a list of strings. |
enum |
|
fields |
Required. String list of field names. |
fieldtype |
Optional. Type of the field to check for defined values: must be |
values |
Required. List of values. For strings and decimals define the values using string literals. For integral types use number literals. |
case-sensitive |
Optional. For string values, whether the value matches should be case-sensitive. Defaults to true. |
range |
|
fields |
Required. List of field names on which to apply the range checks. |
fieldtype |
Optional. The field type to use when doing range checks. Range values will be interpreted as this type. Must be numeric: allowed values are
|
range |
Required. Two element list of numeric literals, e.g. |
ignore-nulls |
Optional. If |
regex |
|
fields |
Required. String list of field names, which should all have type |
regex |
Required. Regular expression with which to match field values. Note that extra escape parameters are not required. For example to match any number up to 999 you could use: |
count |
|
expected.literal |
Either this or |
expected.dependency |
Either this or |
checkschema |
|
schema |
The schema definition. Refer to the Schema documentation |
exactmatch |
Optional. Whether the schema of the Rows must exactly match the specified schema. If false the actual row can contain
other fields not specified in the |
Deriver type
= distinct
.
Configuration suffix | Description |
---|---|
step |
The name of the dataset whose records will be deduplicated. Only required if there is more than one dependency, otherwise optional. |
Deriver type
= in-list
.
Configuration suffix | Description |
---|---|
step |
The name of the dataset whose records will be filtered based on the supplied list of values. Only required if there is more than one dependency, otherwise optional. If provided, the dataset must be present in the list of dependencies. |
field |
The name of the field in dataset’s schema whose values will be compared with the supplied list of values. Only required if dataset schema contains more than one field, otherwise optional. |
values.literal |
A list of values that will be used as a filter against designated |
values.reference.step |
Step whose records will be used to generate a set of values to filter records against. Can only be specified when literal list ( |
values.reference.field |
The name of the field in |
values.reference.batch-size |
The size of the filter batch when generating the values of the |
Deriver type
= hash
.
Configuration suffix | Description |
---|---|
step |
The name of the dataset whose records will be hashed. Only required if there is more than one dependency, otherwise optional. |
hash-field |
The name of the field that will be added with the hash string. Default 'hash'. |
delimiter |
The delimiter that the deriver will use to concatenate the field values of a row. Default empty string. |
null-string |
The string that the deriver will use in place of NULLs when concatenating the field values of a row. Default '__NULL__'. |
include-fields |
The list of field names that will contribute to the hash. Default is all fields are included. Can not be used with |
exclude-fields |
The list of field names that will not contribute to the hash. Default is no fields are excluded. Can not be used with |
Deriver type
= latest
.
Configuration suffix | Description |
---|---|
step |
The name of the dataset whose records will be filtered. Only required if there is more than one dependency, otherwise optional. |
key-fields |
The list of field names that make up the key of the dataset. The result of this deriver will be exactly one record per unique key in the dependency step dataset. |
timestamp-field |
The name of the field used to order the records for an individual key. Only the record with the highest value of this field for a key will be included in the deriver result. |
Deriver type
= translate
.
Configuration suffix | Description |
---|---|
step |
The name of the dataset that contains the field to be translated. |
field |
The name of the field to be translated. |
translator |
The configuration object for the translator that will translate the field. See the derivers guide for more information on this syntax. |
Deriver type
= sparkml
.
Configuration suffix | Description |
---|---|
step |
The name of the dataset that the pipeline model will be executed over. Only required if there is more than one dependency, otherwise optional. |
model-path |
The path to the pipeline model directory that was created by the Spark ML pipeline model save. |
Derive type
= parse-json
.
Configuration suffix | Description |
---|---|
step |
The name of the dependency step that contains the field that contains the JSON strings. |
field |
The name of the field that contains the JSON strings. |
as-struct |
Whether to place the parsed fields within their own struct field, instead of directly on the record. Default false. |
struct-field |
If |
schema |
The schema of the JSON records. Refer to the Schema documentation |
option.* |
Passes through Spark-specific JSON options to Spark.
For example, |
Partitioner configurations belong to data steps, and have the steps.[stepname].partitioner.
prefix.
Configuration suffix | Description |
---|---|
type |
The partitioner type to be used. Envelope provides |
Planner configurations belong to data steps, and have the steps.[stepname].planner.
prefix. For more information on planners see the planners guide.
Configuration suffix | Description |
---|---|
type |
The planner type to be used. Envelope provides |
Planner type
= append
.
Configuration suffix | Description |
---|---|
fields.key |
The list of field names that make up the natural key of the record. Only required if |
field.last.updated |
The field name for the last updated attribute. If specified then Envelope will add this field and populate it with the system timestamp string. |
uuid.key.enabled |
If |
Planner type
= bitemporal
.
Configuration suffix | Description |
---|---|
fields.key |
The list of field names that make up the natural key of the record. |
fields.values |
The list of field names that are used to determine if an arriving record is different to an existing record. |
fields.timestamp |
The list of field names of the event time of the record. |
fields.event.time.effective.from |
The list of field names of the event-time effective-from timestamp attribute on the output. |
fields.event.time.effective.to |
The list of field names of the event-time effective-to timestamp attribute on the output. |
fields.system.time.effective.from |
The list of field names of the system-time effective-from timestamp attribute on the output. |
fields.system.time.effective.to |
The list of field names of the system-time effective-to timestamp attribute on the output. |
field.surrogate.key |
The field name of the surrogate key string attribute on the output. If this configuration is set the planner will populate the field with a UUID string for new records. |
field.current.flag |
The field name of the current flag attribute on the output. |
current.flag.value.yes |
The flag indicating current record. Overrides the default value (Y). |
current.flag.value.no |
The flag indicating non-current record. Overrides the default value (N). |
carry.forward.when.null |
If |
time.model.event |
The time model for interpreting the event time of the arriving and existing records, and for generating the event time effective from/to values. |
time.model.system |
The time model for interpreting the system time of the existing records, and for generating the system time effective from/to values. |
Planner type
= eventtimeupsert
.
Configuration suffix | Description |
---|---|
fields.key |
The list of field names that make up the natural key of the record. |
field.last.updated |
The field name for the last updated attribute. If specified then Envelope will add this field and populate it with the system timestamp. |
fields.timestamp |
The list of field names of the event time of the record. |
fields.values |
The list of field names that are used to determine if an arriving record is different to an existing record. |
field.surrogate.key |
The field name of the surrogate key string attribute on the output. If this configuration is set the planner will populate the field with a UUID string for new records. |
time.model.event |
The time model for interpreting the event time of the arriving and existing records. |
time.model.last.updated |
The time model for generating the last updated values. |
Planner type
= history
.
Configuration suffix | Description |
---|---|
fields.key |
The list of field names that make up the natural key of the record. |
fields.values |
The list of field names that are used to determine if an arriving record is different to an existing record. |
fields.timestamp |
The list of field names of the event time of the record. |
fields.effective.from |
The list of field names of the event-time effective-from timestamp attribute on the output. |
fields.effective.to |
The list of field names of the event-time effective-to timestamp attribute on the output. |
field.current.flag |
The field name of the current flag attribute on the output. |
current.flag.value.yes |
The flag indicating current record. Overrides the default value (Y). |
current.flag.value.no |
The flag indicating non-current record. Overrides the default value (N). |
fields.last.updated |
The list of field names for the last updated attribute. If specified then Envelope will add this field and populate it with the system timestamp. |
field.surrogate.key |
The field name of the surrogate key string attribute on the output. If this configuration is set the planner will populate the field with a UUID string for new records. |
carry.forward.when.null |
If |
time.model.event |
The time model for interpreting the event time of the arriving and existing records, and for generating the effective from/to values. |
time.model.last.updated |
The time model for generating the last updated values. |
Planner type
= upsert
.
Configuration suffix | Description |
---|---|
field.last.updated |
The field name for the last updated attribute. If specified then Envelope will add this field and populate it with the system timestamp string. |
Time model configurations belong to planners, and have the steps.[stepname].planner.time.model.[timename]
prefix. For more information on time models see the planners guide.
Configuration suffix | Description |
---|---|
type |
The time model type to be used. Envelope provides |
Time model type
= nanoswithseqnum
.
This time model has no custom configurations.
Time model type
= stringdate
.
Configuration suffix | Description |
---|---|
format |
The Java SimpleDateFormat format of the date values. Default "yyyy-MM-dd". |
Time model type
= stringdatetime
.
Configuration suffix | Description |
---|---|
format |
The Java SimpleDateFormat format of the date-time values. Default "yyyy-MM-dd HH:mm:ss.SSS". |
Output configurations belong to data steps, and have the steps.[stepname].output.
prefix.
Configuration suffix | Description |
---|---|
type |
The output type to be used. Envelope provides |
Output type
= filesystem
.
Configuration suffix | Description |
---|---|
path |
The Hadoop filesystem path to write as the output. Typically a Cloudera EDH will point to HDFS by default. Use |
format |
The file format for the files of the output directory. Envelope supports formats |
partition.by |
The list of columns to partition the write output. Optional. |
separator |
(csv) Spark option |
quote |
(csv) Spark option |
escape |
(csv) Spark option |
escape-quotes |
(csv) Spark option |
quote-all |
(csv) Spark option |
header |
(csv) Spark option |
null-value |
(csv) Spark option |
compression |
(csv) Spark option |
date-format |
(csv) Spark option |
timestamp-format |
(csv) Spark option |
Output type
= hive
.
Configuration suffix | Description |
---|---|
table |
The name of the Hive table targeted for write. The name can include the database prefix, e.g. |
location |
Optional. The HDFS location for the underlying files of a table. Typically only defined during table creation, during which the table is created as |
partition.by |
Optional. The list of Hive table partition names to dynamically partition the write by. |
align.columns |
If |
options |
Used to pass additional configuration parameters. The parameters are set as a Map object and passed directly to the Spark DataFrameWriter. |
Output type
= jdbc
.
Configuration suffix | Description |
---|---|
url |
The JDBC URL for the remote database. |
tablename |
The name of the table of the remote database to write as the output. |
username |
The username to use to connect to the remote database. |
password |
The password to use to connect to the remote database. |
Output type
= kafka
.
Configuration suffix | Description |
---|---|
brokers |
Required. The hosts and ports of the brokers of the Kafka cluster, in the form |
topic |
Required. The Kafka topic to write to. |
serializer.type |
Required. The type of serialization to use for writing the row in to the topic. Valid types are |
serializer.field.delimiter |
Required if |
serializer.use.for.null |
Used if |
serializer.schema.path |
Required if |
parameter.* |
Used to pass configurations directly to the Kafka client. The |
Output type
= kudu
.
Configuration suffix | Description |
---|---|
connection |
The hosts and ports of the masters of the Kudu cluster, in the form "host1:port1,host2:port2,…,hostn:portn". |
table.name |
The name of the Kudu table to write to. |
insert.ignore |
Ignore duplicate rows in Kudu (default: true) |
ignore.missing.columns |
Ignore writing columns that do not exist in the Kudu schema (default: false) |
secure |
Is the target Kudu cluster secured by Kerberos? This must be set to |
Output type
= log
.
Configuration suffix | Description |
---|---|
delimiter |
The delimiter string to separate the field values with. Default is |
level |
The log4j level for the written logs. Default is |
Output type
= hbase
.
Configuration suffix | Description |
---|---|
table.name |
Required. The table for the output, specified in the format |
zookeeper |
Optional. In non-secure setups it is not a strict requirement to supply an hbase-site.xml file on the classpath, so the ZooKeeper quorum can be specified with this property with the usual HBase configuration syntax. Note that this will supersede any quorum specified in any hbase-site.xml file on the classpath. |
hbase.conf.* |
Optional. Pass-through options to set on the HBase connection. The …. hbase { conf { hbase.client.retries.number = 5 hbase.client.operation.timeout = 30000 } } …. Note that non-String parameters are automatically cast to Strings, but the underlying HBase code will do any required conversions from String. |
mapping.serde |
Optional. The fully qualified class name of the implementation to use when converting Spark |
mapping.rowkey.columns |
Required for |
mapping.rowkey.separator |
Optional. The separator to use when constructing the row key. This is interpreted as a Unicode string
so for binary separators use the |
mapping.columns |
Required for …. mapping.columns { symbol { cf = "rowkey" col = "symbol" type = "string" } transacttime { cf = "rowkey" col = "transacttime" type = "long" } clordid { cf = "cf1" col = "clordid" type = "string" } orderqty { cf = "cf1" col = "orderqty" type = "int" } } …. |
batch.size |
Optional. An integer value with default 1000. The number of mutations to accumulate before making an HBase RPC call. For larger cell sizes you may want to reduce this number or increase the relevant client buffers. |
Output type
= zookeeper
.
Configuration suffix | Description |
---|---|
connection |
The ZooKeeper quorum to connect to, in the format |
schema |
The schema definition. Refer to the Schema documentation |
key.field.names |
The list of field names that constitute the unique key of the output. Must be a subset of |
znode.prefix |
The znode path prefix that the data will be stored under. Used to isolate the use of the output from other uses of the output, and from non-Envelope paths in ZooKeeper. Default |
session.timeout.millis |
The client session timeout in milliseconds. Default |
connection.timeout.millis |
The client connection timeout in milliseconds. Default |
For more information on tasks see the tasks guide.
Task configurations are provided at the task step level (i.e. alongside type
and class
).
Task class
= exception
.
Configuration suffix | Description |
---|---|
message |
The message that will be included on the exception. Mandatory. |
Task class
= impala_ddl
.
Configuration suffix | Description |
---|---|
host |
(Required.) The Impala daemon or load balancer fully qualified hostname to which to connect. |
port |
(Optional.) Port on which to connect to the Impala daemon. Defaults to 21050. |
auth |
(Optional.) Authentication method to use. Allowed values are |
debug |
(Optional.) Display debug information about authentication. |
krb-keytab |
(Required if using |
krb-user-principal |
(Required if using |
krb-realm |
(Optional.) If using a non-default realm, specify it in this parameter. Otherwise the default realm is extracted from |
krb-ticket-renew-interval |
(Optional.) Time in seconds in which to re-obtain a Kerberos TGT. If not specified it is derived from the default ticket lifetime of TGTs from |
ssl |
(Optional.) Whether TLS is enabled on the JDBC connection to Impala. Defaults to |
ssl-truststore |
(Optional.) JKS truststore to use when validating the Impala TLS server certificate. Defaults to the in-built JRE truststore. |
ssl-truststore-password |
(Optional.) If the supplied truststore requires a password to read certificates, supply it here. Defaults to empty. |
username |
(Required if using |
password |
(Required if using |
query.type |
(Required.) The DDL operation to perform. Currently supported: |
query.table |
(Required.) The table name for the DDL operation. |
query.partition.spec |
(Required if operation is |
query.partition.location |
(Optional.) A location of a partition on HDFS to be specified in the DDL operation. |
query.partition.range |
(Required if operation is |
query.partition.range.value |
(Required if using range and boundaries not supplied.) An absolute numeric value for the lower bound of a Kudu range partition. |
query.partition.range.start |
(Required if using range and value not supplied.) An absolute numeric value for the lower bound of a Kudu range partition. Defaults to inclusive. |
query.partition.range.end |
(Required if using range and value not supplied.) An absolute numeric value for the upper bound of a Kudu range partition. Defaults to exclusive. |
query.partition.range.inclusivity |
(Optional.) A string indicating the range operator of the lower and upper bound, "i" for inclusive and "e" for exclusive. Allowed values are "ie", "ii", "ei", "ee". |
For more information on repetitions see the repetitions guide.
The general configuration parameters for repetitions are:
Configuration suffix | Description |
---|---|
type |
Required. The repetition type to be used. Envelope provides |
min-repeat-interval |
Optional. To prevent steps being reloaded too frequently, this represents the minimum interval between repetitions. The value is interpreted as a
Typesafe Config duration, e.g. |
Repetition type
= scheduled
.
Configuration suffix | Description |
---|---|
every |
Required. The interval between repetitions. The value is interpreted as a
Typesafe Config duration, e.g. |
Repetition type
= flagfile
.
Configuration suffix | Description |
---|---|
file |
Required. The path to the flag file. Accepts a fully qualified URI (recommended). If not qualified with a filesystem scheme, the default filesystem implementation will be used (usually HDFS). |
trigger |
Optional. The mode of the trigger functionality. Can either be |
poll-interval |
Optional. How often the flag file will be checked. The value is interpreted as a
Typesafe Config duration, e.g. |
fail-after |
To prevent intermittent failures to contact the filesystem from killing the job, the repetition will only raise an exception after this many consecutive failures. Defaults to 10. |
Envelope automatically validates the pipeline configuration before starting execution. This feature can be disabled by setting configuration.validation.enabled = false
either at the top-level for the whole pipeline, or within any scope that would be validated.
The configurations of a custom Envelope plugin (e.g. a custom deriver) can also be validated by implementing the ProvidesValidations
interface. In the less common case that the plugin has its own plugins (similarly to how the data quality deriver has pluggable rules) then the higher-level plugin can implement the InstantiatesComponents
interface to provide its own plugins to Envelope for configuration validation. For both of these interfaces see the Envelope code for various examples of their implementations.
Spark SQL user-defined functions (UDFs) are provided with a list of UDF specifications under udfs
, where each specification has the following:
Configuration suffix | Description |
---|---|
name |
The name of the UDF that will be used in SQL queries. |
class |
The fully qualified class name of the UDF implementation. |
Envelope provides a number of ways to define the schema for components. Data type mappings are outlined in the Data Type Support section below.
Schema type
= flat
.
Configuration suffix | Description |
---|---|
field.names |
The list of field names in the schema. |
field.types |
The list of field types in the schema |
Schema type
= avro
.
Configuration suffix | Description |
---|---|
filepath |
The path to a file containing the Avro schema definition. |
literal |
The literal JSON string defining the Avro schema. |
Envelope supports the following Spark data types when defining a schema in-line (for example, using schema.type
= flat
):
-
string
-
byte
-
short
-
int
-
long
-
float
-
double
-
decimal(precision,scale)
or (decimal
, which defaults to (10,0) per DataTypes.createDecimalType) -
boolean
-
binary
-
date
-
timestamp
When using an Avro schema to define the Spark schema (for example, schema.type
= avro
), either via an inline Avro literal or a supporting Avro file, the following Spark data types are supported:
Avro Type | Data Type |
---|---|
record |
StructType |
array |
Array |
map |
Map (note: keys must be Strings) |
union |
StructType (each column representing the union elements, named |
bytes, fixed |
Binary |
string, enum |
String |
int |
Integer |
long |
Long |
float |
Float |
double |
Double |
boolean |
Boolean |
null |
Null |
date (LogicalType, as |
Date |
timestamp-millis (LogicalType, as |
Timestamp |
decimal (LogicalType, as |
Decimal |
When using a Protobuf schema to define the Spark schema (for example, schema.type
= protobuf
), the following Spark data types are supported:
Protobuf FieldDescriptor Type | Data Type |
---|---|
BOOLEAN |
Boolean |
BYTE_STRING |
Binary |
DOUBLE |
Double |
ENUM |
String |
FLOAT |
Float |
INT |
Integer |
LONG |
Long |
MESSAGE |
StructType |
STRING |
String |
double |
Double |
boolean |
Boolean |