How should the SDK deal with settings for `load_schema` and `schema_mappings` for SQL-based targets #1084

aaronsteers · 2022-10-18T23:22:07Z

aaronsteers
Oct 18, 2022

Discussion on spec to implement here:

For SQL-based targets, add built-in handling for schema_mapping #1086 (comment)

Pipelinewise has a precedent of using default_target_schema for the default load schema, and schema_mappings to allow overrides/remappings based on the upstream source system name.

For ref: https://github.com/transferwise/pipelinewise-target-snowflake#configuration-settings

Other implementations use schema as the setting name instead of default_target_schema, but I don't personally prefer this because schema is inherently a highly overloaded term and often applied for JSON schema (which this isn't). The names default_target_schema and/or load_schema both seems more clear to me.

I personally slightly prefer load_schema over schema or default_target_schema for the name of the default load schema setting - but I don't feel very strongly about it and I can see some benefit of using the default_target_schema setting name, which has significant precedent. I don't see any need to rename schema_mappings because it also seems clear, concise, and relatively intuitive - although admittedly we don't need to add this feature immediately.

Now that we are building out the SQL target capabilities in the SDK, would be helpful to put some thought into the naming here.

Does anyone else have strong feelings about this?

cc @tayloramurphy, @edgarrmondragon, @kgpayne

From the Pipelinewise docs about the behavior of each of their 2 settings:

default_target_schema - Name of the schema where the tables will be created, without database prefix. If schema_mapping is not defined then every stream sent by the tap is loaded into this schema.
schema_mapping - Useful if you want to load multiple streams from one tap to multiple Snowflake schemas. If the tap sends the stream_id in <schema_name>-<table_name> format then this option overwrites the default_target_schema value.

kgpayne · 2022-10-19T15:08:15Z

kgpayne
Oct 19, 2022

I believe #1036 partially implements the schema_mapping case from PPW. I.e. if the stream_name is splittable with a -, the schema from the split will be used. Otherwise the schema_name property returns None (as it does currently) relying on a default schema provided by the SQLAlchemy connection.

Currently the SDK does not provide a way to set the schema/search path on the connection, without relying on a snippet as @BuzzCutNorman suggests here or the inclusion of connect_args passed to the create_engine() SQLAlchemy method called by create_sqlalchemy_engine on the SQLConnector class on the SDK here.

To resolve this in #1036 I suggest we:

Support connect_args in settings for all SQL-based taps/targets in addition to the existing sqlalchemy_url for use with create_engine so that users can provide a search path via config (as well as any other options, such as timeouts etc.) without overriding any methods in the SDK.
Document its use for setting the source and load schema in taps/targets.

e.g.

engine = sqlalchemy.create_engine(
    self.sqlalchemy_url,
    connect_args=self.config.get('sqlalchemy_connect_args', {}),
    echo=False
)

with the config snippet:

sqlalchemy_url: 
sqlalchemy_connect_args:
  options: "-csearch_path=my_schema"

We could then (additionally and in a future PR) support a load_schema on targets to override both the connection schema (if supplied) and the stream-name derived schema (if splittable) and therefore force the target schema.

@aaronsteers @BuzzCutNorman WDYT?

2 replies

BuzzCutNorman Oct 19, 2022

I like the addition of connect_args. I am not familiar with the -csearch_path option, will give it a try and let you know how it goes.

BuzzCutNorman Oct 19, 2022

Worked like a charm. @kgpayne where did you run into that? I couldn't find it in the SQLAlchemy documentation but did see it mentioned with pycopg2 driver in Stackoverflow examples. Is this a DBAPI option that all compliant drivers would respond to?

tayloramurphy · 2022-10-23T18:26:22Z

tayloramurphy
Oct 23, 2022
Maintainer

@aaronsteers no super strong feelings. I like load_schema as well. I have no problems with schema_mapping.

3 replies

aaronsteers Oct 28, 2022
Author

Regarding the naming of these, I originally was inclined to remove the default_ prefix from default_[load|target]_schema. Then I realized that the default_ prefix is in part to communicate that the 'default' will be overriden when schema_mappings are also provided. Without the word 'default' as a context, it seems less clear how names will be resolved if the load_schema and schema_mapping are both provided.

So, I'd be inclined to use either:

default_load_schema
schema_mapping

Or just keep with pipelinewise precedent:

default_target_schema
schema_mapping

With all other factors the same, I don't have strong enough preference of default_load_schema being vastly better than default_target_schema - and for perhaps a very minor improvement in phrasing, I don't know that it is worth it to break from precedent and now have both floating around as common implementations of basically the same configuration spec.

@tayloramurphy, @kgpayne, @edgarrmondragon (and all):

Q: Given the above, what do you think of just keeping schema_mapping and default_target_schema as our SDK-default setting names when we implement this?

From the Pipelinewise docs, those definitions again would be:

default_target_schema - Name of the schema where the tables will be created, without database prefix. If schema_mapping is not defined then every stream sent by the tap is loaded into this schema.
schema_mapping - Useful if you want to load multiple streams from one tap to multiple Snowflake schemas. If the tap sends the stream_id in <schema_name>-<table_name> format then this option overwrites the default_target_schema value.

edgarrmondragon Oct 28, 2022
Maintainer

Q: Given the above, what do you think of just keeping schema_mapping and default_target_schema as our SDK-default setting names when we implement this?

I like it. We stick to precedent, fewer new things to learn/know for users changing variants to an SDK-based one.

tayloramurphy Nov 1, 2022
Maintainer

@aaronsteers thanks for the write up. I'm in favor of your proposal.

kgpayne · 2022-11-02T16:27:01Z

kgpayne
Nov 2, 2022

Notes from Office Hours 2/11/2022:

Suggestion from @visch that default_target_schema is i) a lighter lift to implement and ii) sufficient as a first pass to be useful.
Consensus on the call that schema_mapping sounds interesting, but not as a pressing requirement in this pass.

Outcome: deliver default_target_schema first and on its own, and tackle schema_mapping in its own issue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should the SDK deal with settings for `load_schema` and `schema_mappings` for SQL-based targets #1084

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How should the SDK deal with settings for load_schema and schema_mappings for SQL-based targets #1084

aaronsteers Oct 18, 2022

Replies: 3 comments · 5 replies

kgpayne Oct 19, 2022

BuzzCutNorman Oct 19, 2022

BuzzCutNorman Oct 19, 2022

tayloramurphy Oct 23, 2022 Maintainer

aaronsteers Oct 28, 2022 Author

edgarrmondragon Oct 28, 2022 Maintainer

tayloramurphy Nov 1, 2022 Maintainer

kgpayne Nov 2, 2022

How should the SDK deal with settings for `load_schema` and `schema_mappings` for SQL-based targets #1084

aaronsteers
Oct 18, 2022

Replies: 3 comments 5 replies

kgpayne
Oct 19, 2022

tayloramurphy
Oct 23, 2022
Maintainer

aaronsteers Oct 28, 2022
Author

edgarrmondragon Oct 28, 2022
Maintainer

tayloramurphy Nov 1, 2022
Maintainer

kgpayne
Nov 2, 2022