SqlTypeHandler - Proposing an additional way to transform assets stored in databases #17848
j-blackwell
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
full implementation: #17825
TLDR: transform assets using SQL queries rather than in-memory transformations
Explanation
For non-DBT users, there is not currently a coherent way to manipulate assets via SQL whilst taking advantage of a lot of powerful functionality that IO managers bring, e.g.
This existing functionality is especially useful when using partitioned assets as these queries become non-trivial and makes interacting with the tables directly in the database (without this functionality) pretty cumbersome.
Therefore, could we create a type handler that can use SQL queries to apply transformations directly on the database while taking advantage of existing functionality: Instead of loading/saving data via dataframes, could we send an
INSERT INTO
query to execute on the database, or construct aSELECT
statement.A lot of the code is already implemented with the
TableSlice
and its use within the dagster database clients. This keeps the asset code incredibly clean.Much of the remaining code is adapted from this dagster blog post, but converting from a separate IO manager into a type handler for the existing database IO managers gives some additional great functionality.
Examples
Basic example
Define an asset, and refer to it in a downstream asset.
Static partitioned asset
We don't have to manual handle the partitions. When loading the input of
test_sql_asset0
in the downstream asset, the type handler is able to return the query that only concerns the partition in we're running.We also don't have to manually delete the old records, and construct the
INSERT INTO
query from within the asset.Date partitioned asset
Works same as above.
Multi partitioned asset
Works the same as above.
Multiple type handlers
Until writing this handler, I wasn't completely familiar with the full power of type handlers for the database IO managers. But a great use-case emerges when you want to do some transformations in SQL and some in memory using a more powerful library.
Assets can be loaded as SQL jobs, and loaded in downstream assets as pandas dataframes, and so on. This gives you the option to use the best framework for the job on the asset level! Whilst storing all assets in the same database.
Summary
This gives non-DBT users extra flexibility to execute SQL queries against assets within a database, while keeping the powerful functionality of IO managers and keeping a consistent workflow.
This helps us in 3 main areas:
Let me know what you think of the concept, as well as any comments on the full implementation: #17825
Beta Was this translation helpful? Give feedback.
All reactions