SqlTypeHandler - Proposing an additional way to transform assets stored in databases #17848

j-blackwell · 2023-11-09T13:01:05Z

j-blackwell
Nov 9, 2023

full implementation: #17825

TLDR: transform assets using SQL queries rather than in-memory transformations

Explanation

For non-DBT users, there is not currently a coherent way to manipulate assets via SQL whilst taking advantage of a lot of powerful functionality that IO managers bring, e.g.

deleting old records and inserting new records
generating select queries

This existing functionality is especially useful when using partitioned assets as these queries become non-trivial and makes interacting with the tables directly in the database (without this functionality) pretty cumbersome.

Therefore, could we create a type handler that can use SQL queries to apply transformations directly on the database while taking advantage of existing functionality: Instead of loading/saving data via dataframes, could we send an INSERT INTO query to execute on the database, or construct a SELECT statement.

A lot of the code is already implemented with the TableSlice and its use within the dagster database clients. This keeps the asset code incredibly clean.

Much of the remaining code is adapted from this dagster blog post, but converting from a separate IO manager into a type handler for the existing database IO managers gives some additional great functionality.

Examples

Basic example

Define an asset, and refer to it in a downstream asset.

@asset(io_manager_key="sql_io_manager")
def test_sql_asset0():
    return SqlQuery("SELECT 1 AS test_col")
    
@asset(io_manager_key="sql_io_manager")
def test_sql_asset1(test_sql_asset0):
    return SqlQuery("SELECT test_col + 1 AS new_test_col FROM $test_sql_asset0", test_sql_asset0=test_sql_asset0)

Static partitioned asset

We don't have to manual handle the partitions. When loading the input of test_sql_asset0 in the downstream asset, the type handler is able to return the query that only concerns the partition in we're running.
We also don't have to manually delete the old records, and construct the INSERT INTO query from within the asset.

@asset(
    io_manager_key="sql_io_manager",
    partitions_def=StaticPartitionsDefinition(["a", "b"]),
    metadata={"partition_expr": "partition_col"}
)
def test_sql_asset0():
    # don't filter for partition key for purpose of example
    return SqlQuery(
        "SELECT * FROM "
        "(VALUES (1, 'a'), (2, 'a'), (3, 'b')) "
        "my_table(test_col, partition_col)"
    )

@asset(
    io_manager_key="sql_io_manager",
    partitions_def=StaticPartitionsDefinition(["a", "b"]),
    metadata={"partition_expr": "partition_col"}
)
def test_sql_asset1(test_sql_asset0: SqlQuery):
    return SqlQuery(
        "SELECT test_col + 1 AS new_test_col, partition_col FROM $test_sql_asset0",
        test_sql_asset0=test_sql_asset0
    )

Date partitioned asset

Works same as above.

@asset(
    io_manager_key="sql_io_manager",
    partitions_def=DailyPartitionsDefinition(start_date="2023-01-01"),
    metadata={"partition_expr": "partition_col"}
)
def test_sql_asset0():
    # don't filter for partition key
    return SqlQuery(
        "SELECT * FROM "
        "(VALUES (1, '2023-01-01 00:00:00'), (2, '2023-01-02 00:00:00')) "
        "my_table(test_col, partition_col)"
    )

@asset(
    io_manager_key="sql_io_manager",
    partitions_def=DailyPartitionsDefinition(start_date="2023-01-01"),
    metadata={"partition_expr": "partition_col"}
)
def test_sql_asset1(test_sql_asset0: SqlQuery):
    return SqlQuery(
        "SELECT test_col + 1 AS new_test_col, partition_col FROM $test_sql_asset0",
        test_sql_asset0=test_sql_asset0
    )

Multi partitioned asset

Works the same as above.

Multiple type handlers

Until writing this handler, I wasn't completely familiar with the full power of type handlers for the database IO managers. But a great use-case emerges when you want to do some transformations in SQL and some in memory using a more powerful library.

Assets can be loaded as SQL jobs, and loaded in downstream assets as pandas dataframes, and so on. This gives you the option to use the best framework for the job on the asset level! Whilst storing all assets in the same database.

@asset(io_manager_key="sql_io_manager")
def test_sql_asset0() -> pd.DataFrame:
    return pd.DataFrame({"a": [1, 2, 3]})

@asset(io_manager_key="sql_io_manager")
def test_sql_asset1(test_sql_asset0: SqlQuery) -> SqlQuery:
    query = SqlQuery("SELECT * FROM $test_sql_asset0 WHERE a > 1", test_sql_asset0=test_sql_asset0)
    return query

# e.g. duckdb
multi_duckdb_io_manager = build_duckdb_io_manager(
    type_handlers=[DuckDBSqlTypeHandler(), DuckDBPandasTypeHandler()],
    default_load_type=SqlQuery)

Summary

This gives non-DBT users extra flexibility to execute SQL queries against assets within a database, while keeping the powerful functionality of IO managers and keeping a consistent workflow.

This helps us in 3 main areas:

Saving the effort in refactoring old SQL queries when migrating projects to dagster
Save on network ingress/egress costs when loading data from GCP into dagster cloud memory
Enable us to query datasets that will not fit (or should not be executed) in memory

Let me know what you think of the concept, as well as any comments on the full implementation: #17825

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SqlTypeHandler - Proposing an additional way to transform assets stored in databases #17848

{{title}}

Replies: 0 comments

Select a reply

SqlTypeHandler - Proposing an additional way to transform assets stored in databases #17848

j-blackwell Nov 9, 2023

Explanation

Examples

Basic example

Static partitioned asset

Date partitioned asset

Multi partitioned asset

Multiple type handlers

Summary

Replies: 0 comments

j-blackwell
Nov 9, 2023