Skip to content

Latest commit

 

History

History
335 lines (256 loc) · 14.2 KB

MultiTableDatabase-batchsource.md

File metadata and controls

335 lines (256 loc) · 14.2 KB

MultiTableDatabase Batch Source

Description

Reads from multiple tables within a database using JDBC. Often used in conjunction with the DynamicMultiFileset sink to perform dumps from multiple tables to HDFS files in a single pipeline. The source will output a record for each row in the tables it reads, with each record containing an additional field that holds the name of the table the record came from. In addition, for each table that will be read, this plugin will set pipeline arguments where the key is 'multisink.[tablename]' and the value is the schema of the table. This is to make it work with the DynamicMultiFileset.

Properties

Reference Name: This will be used to uniquely identify this source for lineage, annotating metadata, etc.

JDBC Connection String: The JDBC connection string to the database. For example: jdbc:mysql://HOST/DATABASE.

Database User Name: The username to use when connecting to the database.

Database User Password: The password to use when connecting to the database.

JDBC Plugin Name: The name of the JDBC plugin to use.

Enable Auto Commit: Whether to enable auto-commit for queries run by this source. In most cases, set to false. If you use a JDBC driver that results in an error when the commit operation is run, set to 'true'.

Data Selection Mode: The operation mode for this plugin. There are 3 possible values:

  1. Allow List: Define the list of database tables to include.
  2. Block List: Define the list of database tables to exclude.
  3. SQL Statements: Manually define each of the SQL Statement to execute.

Depending on the selected Data Selection Mode, the following options will be available:

Table Name Pattern: A pattern that defines which tables should be read from. Any table whose name matches the pattern will read. If not specified, all tables will be read. Pattern syntax is specific to the type of database that is being connected to. This option is only displayed when the Data Selection Mode is either Allow List or Block List.

Schema Name Pattern: A pattern that defines which schemas should be used to list the tables. Any schema whose name matches the pattern will read. If not specified, all schema will be read. Pattern syntax is specific to the type of database that is being connected to. This option is only displayed when the Data Selection Mode is either Allow List or Block List.

Where Clause: Filters which records needs to be consumed from each table: i.e. where updated_at > '2018-08-20 00:00:00'. The where clause will be applied to every table that is being read. Therefore, all the columns that are mentioned in the where clause should be present in each table. This option is only displayed when the Data Selection Mode is either Allow List or Block List.

Allowlist of Table Names: Used in conjunction with tableNamePattern, this configuration specifies tables to be read. If no value is specified in the whiteList all tables matching the tableNamePattern will be read. By default reads all tables matching the tableNamePattern. This option is only displayed when the Data Selection Mode is Allow List.

Blocklist of Table Names: Used in conjunction with tableNamePattern, this configuration specifies the tables to be skipped. By default the black list is empty which means no tables will be skipped. This option is only displayed when the Data Selection Mode is Block List.

Splits Per Table: The number of splits per table. By default is 1. This option is only displayed when the Data Selection Mode is either Allow List or Block List.

SQL Statements: List of SQL statements to execute. Each statement will be handled as a different partition. When submitting this statements using the API, use a semicolon ; as a separator. If the SQL statement includes a semicolon ; character, you will need to escape it using \;. This option is only displayed when the Data Selection Mode is SQL Statements.

Table Aliases: List of aliases to use for the datasets generated by the supplied SQL statements. If supplied, the Table Alias for each SQL statement will be used as the Table Name value for each record. The SQL statements will be matched to a Table Alias based on the order in which they appear on the list. This option is only displayed when the Data Selection Mode is SQL Statements.

Table Name Field: The name of the field that holds the table name. Must not be the name of any table column that will be read. Defaults to 'tablename'.

Transaction Isolation Level: The transaction isolation level for queries run by this sink. Defaults to TRANSACTION_SERIALIZABLE. See java.sql.Connection#setTransactionIsolation for more details. The Phoenix jdbc driver will throw an exception if the Phoenix database does not have transactions enabled and this setting is set to true. For drivers like that, this should be set to TRANSACTION_NONE.

Fetch Size: The number of rows to fetch at a time per split. Larger fetch size can result in faster import, with the tradeoff of higher memory usage.

Custom SQL Statements

When using the Data Selection Mode called SQL Statements, the supplied list of SQL statements will be executed as supplied using the specified database connection.

If the query contains a semicolon ; character as part of the query, this character must be escaped using a backslash \;.

Every record that is generated will have a Table Name field (defined by the Table Name Field property, which defaults to tablename) that identifies the source of each record.

If a Table Alias for a given SQL statement is spacified, this value will be returned as the Table Name for the records generated using this SQL statement.

It a Table Alias for a given statement is not provided, the Table Names will be generated using a best-guess approach based on the JDBC API's Result Set Metadata. Please note that not all JDBC drivers offer this functionality. If multiple tables are used in one statement, the resulting table name will be a concatenation of all the distinct table names present in the returned rows for this table, in order of first appearance.

If the Table Name(s) cannot be derived from the Result Set Metadata (because the JDBC driver doesn't support this functionality, for example), the derived Table Name attribute will be in the pattern of sql_statement_<number> where the Number is the position of the SQL statement in the list of statements (starting at position 1).

See the Derived Table Name Examples section for more details.

Example

This example reads from all tables in the 'customers' database on host 'host123.example.net':

{
    "name": "MultiTableDatabase",
    "type": "batchsource",
    "properties": {
        "connectionString": "jdbc:mysql://host123.example.net/customers",
        "jdbcPluginName": "mysql",
        "splitsPerTable": "2"
    }
}

Suppose you have two tables in the 'customers' database, where ID column is the primary key in both tables. The first table is named 'accounts' and contains:

+-----+----------+------------------+
| ID  | name     | email            |
+-----+----------+------------------+
| 0   | Samuel   | [email protected] |
| 1   | Alice    | [email protected]    |
| 2   | Bob      | [email protected]    |
| 3   | John     | [email protected]    |
+-----+----------+------------------+

The second is named 'activity' and contains:

+-----+--------+----------+--------+
| ID  | userid | item     | action |
+-----+--------+----------+--------+
| 0   | 0      | shirt123 | view   |
| 1   | 0      | carxyz   | view   |
| 2   | 0      | shirt123 | buy    |
| 3   | 0      | coffee   | view   |
| 4   | 1      | cola     | buy    |
| 5   | 1      | pepsi    | buy    |
+-----+--------+----------+--------+

You will have 4 splits (2 per each table) with such queries:

SELECT * FROM accounts WHERE ( ID >= 0 ) AND ( ID < 1 )
SELECT * FROM accounts WHERE ( ID >= 2 ) AND ( ID <= 3 )
SELECT * FROM activity WHERE ( ID >= 0 ) AND ( ID < 3 )
SELECT * FROM activity WHERE ( ID >= 3 ) AND ( ID <= 5 )

The output of the the source will be the following records:

+-----+----------+------------------+-----------+
| ID  | name     | email            | tablename |
+-----+----------+------------------+-----------+
| 0   | Samuel   | [email protected] | accounts  |
| 1   | Alice    | [email protected]    | accounts  |
| 2   | Bob      | [email protected]    | accounts  |
| 3   | John     | [email protected]    | accounts  |
+-----+----------+------------------+-----------+
+-----+--------+----------+--------+-----------+
| ID  | userid | item     | action | tablename |
+-----+--------+----------+--------+-----------+
| 0   | 0      | shirt123 | view   | activity  |
| 1   | 0      | carxyz   | view   | activity  |
| 2   | 0      | shirt123 | buy    | activity  |
| 3   | 0      | coffee   | view   | activity  |
| 4   | 1      | cola     | buy    | activity  |
| 5   | 1      | pepsi    | buy    | activity  |
+-----+--------+----------+--------+-----------+

SQL Statements Example

Suppose you have two tables in the 'customers' database, where ID column is the primary key in both tables. The first table is named 'accounts' and contains:

+-----+----------+------------------+
| ID  | name     | email            |
+-----+----------+------------------+
| 0   | Samuel   | [email protected] |
| 1   | Alice    | [email protected]    |
| 2   | Bob      | [email protected]    |
| 3   | John     | [email protected]    |
+-----+----------+------------------+

The second is named 'activity' and contains:

+-----+--------+----------+--------+
| ID  | userid | item     | action |
+-----+--------+----------+--------+
| 0   | 0      | shirt123 | view   |
| 1   | 0      | carxyz   | view   |
| 2   | 0      | shirt123 | buy    |
| 3   | 0      | coffee   | view   |
| 4   | 1      | cola     | buy    |
| 5   | 1      | pepsi    | buy    |
+-----+--------+----------+--------+

The output of the following query:

SELECT
  accounts.ID AS account_id,
  accounts.name,
  accounts.email,
  activity.ID AS activity_id,
  activity.item,
  activity.action
FROM
  accounts JOIN
  activity ON activity.userid = accounts.ID

Will be the following records:

+------------+--------+------------------+-------------+----------+--------+-------------------+
| account_id | name   | email            | activity_id | item     | action | tablename         |
+------------+--------+------------------+-------------+----------+--------+-------------------+
| 0          | Samuel | [email protected] | 0           | shirt123 | view   | accounts_activity |
| 0          | Samuel | [email protected] | 1           | carxyz   | view   | accounts_activity |
| 0          | Samuel | [email protected] | 2           | shirt123 | buy    | accounts_activity |
| 0          | Samuel | [email protected] | 3           | coffee   | view   | accounts_activity |
| 1          | Alice  | [email protected]    | 4           | cola     | buy    | accounts_activity |
| 1          | Alice  | [email protected]    | 5           | pepsi    | buy    | accounts_activity |
+------------+--------+------------------+-------------+----------+--------+-------------------+

Derived Table Names

Only one table per query

Here are some example tables and the derived table names:

The resulting records for the following query:

SELECT 
  users.id, 
  users.name 
FROM 
  users 
WHERE 
  users.name = 'John'

Will have the table name users

Joined tables

If the query joins multiple tables, the order in which the columns are returned defines the derived table name:

The resulting records for the following query:

SELECT 
  u.id, 
  c.id 
FROM 
  users u JOIN 
  comments c ON u.id = c.user_id 
WHERE 
  u.name = 'John'

Will have the table name users_comments. However, the resulting records for the following query:

SELECT 
  c.id, 
  u.id 
FROM 
  users u JOIN 
  comments c ON u.id = c.user_id 
WHERE 
  u.name = 'John'

Will have the table name comments_users as the first returned column comes from the comments table.

Note that, as mentioned earlier, the table names are concatenated in order of first appereance, without duplicates:

SELECT 
   u.id, 
   s.upvotes, 
   s.downvotes, 
   c.id, 
   u.username,
   c.timestamp,
   s.id
FROM 
   users u JOIN 
   comments c ON u.id = c.user_id JOIN 
   scores s ON s.comment_id = c.id
WHERE 
  u.name = 'John'

Will have the table name users_scores_comments.

If table names cannot be derived

There are instances where a table name cannot be derived. In this case, table names will follow the follwing pattern:

sql_statement_<number>

The <number> in this statement will be the position of this query in the list of statements, starting with position 1.

Examples

For example, if the JDBC driver is able to provide the Table Name from a SQL statement ResultSet, the derived table name will look as follows:

SQL Statement Table Alias JDBC Driver support Derived Table Name
SELECT * FROM table1 my_alias_1 Yes my_alias_1
SELECT * FROM table2 Yes table2
SELECT * FROM users JOIN orders ON ... <empty? Yes users_orders

In case the JDBC driver is NOT able to provide the Table Name from a SQL statement ResultSet, the derived table name will look as follows:

SQL Statement Table Alias JDBC Driver support Derived Table Name
SELECT * FROM table1 my_alias_1 No my_alias_1
SELECT * FROM table2 No sql_statement_2
SELECT * FROM users JOIN orders ON ... <empty? No sql_statement_3