Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support Singer Decimal as a config item #1890

Open
s7clarke10 opened this issue Aug 1, 2023 · 8 comments
Open

feat: Support Singer Decimal as a config item #1890

s7clarke10 opened this issue Aug 1, 2023 · 8 comments

Comments

@s7clarke10
Copy link
Contributor

s7clarke10 commented Aug 1, 2023

Feature scope

Taps (catalog, state, stream maps, tests, etc.)

Description

There have been discussions before about whether it is a best practice to emit Decimal, Integer, and Float datatypes as a string or in their native formats.

I have mentioned this in passing in this JSON dicussion #1046 . The author of msgspec strongly recommends sending numeric data as a string.

There was an informal standard called Singer Decimal where the tap would emit the Numeric data as a string (and if possible) try and turn it back into a numeric format in the target.

You may ask, why would you do this? The key reasons are :

  1. Rounding issues with float datatypes
  2. Inability to handle large numbers or decimal places
  3. Python not handling the numeric data well
  4. The tap does not have enough information in the database schema to hint the correct format. For example in Oracle it is a common practice to specify a datatype of number without supplying a precision or scale. The only accurate way to determine the correct precision and scale in the target is to sample the data. In this situation the only reasonable way to send this data is as a string, and then use a transform tool to cast the data.

For a number of database related taps we have implement a setting called use_singer_decimal. When this is set to true, it will send the data as a string but provides additional information about the scale and precision (if available) in the singer messages. When the target receives the message, if it has this additional singer decimal data, it can cast the data with the correct datatype. The good thing about this approach is that it is a setting, therefore based on your use case, you can emit numeric data in its native numeric format (default setting), or preferable enable the use_singer_decimal setting to gain more accurate numeric data.

We have found by using this methodology, there is no loss of precision. It would be great to introduce this ability into the SDK as it will ensure accurate numeric data being sent.

Examples of enhanced database taps which support this are:

Example enhanced target with support for decoding singer-decimal data:

More detail for tap-db2 written by @mjsqu .

DECIMALs from DB2 are given format property "singer.decimal" in the schema, with type property "number". The numeric scale & precision are stored as schema[additionalProperties][scale_precision], e.g. "additionalProperties":{"scale_precision":"(15,9)"}.

When the tap writes the RECORD message, "singer.decimal" informs the data value to be output into the message as a string
The snowflake target obtains the scale/precision from the schema message to create a column of type NUMBER(scale, precision).

@s7clarke10 s7clarke10 added kind/Feature New feature or request valuestream/SDK labels Aug 1, 2023
@edgarrmondragon
Copy link
Collaborator

So, I think there's two parts to this AFAICT.

Support at the tap

This is arguably already support by the typing helpers via CustomType:

jsonschema = PropertiesList(
  Property("my_number", CustomType({"type": "string", "format": "singer.decimal"})),
)

A nicer wrapper could also be added.

Support at the target

to_sql_type could be adapted to map the singer.decimal format to the appropriate SQL column type.

@s7clarke10
Copy link
Contributor Author

Hi @edgarrmondragon,

To show an example of my usage of Singer Decimal. I have a working copying along with the usage of msgspec on the traditional singer-python frameworks used by some non-SDK taps. I am seeing some good performance from doing this. https://github.com/s7clarke10/pipelinewise-singer-python.

For testing this I have created a branch on pipelinewise-tap-oracle to use msgspec.

Of note is the use of a boolean config setting use_singer_decimal in the tap. As a user of a tap, I have a choice whether I wish to have decimal and float data emitted as a string - "use_singer_decimal": true (to avoid precision loss), or in their native formats for backwards compatibility. It seems to work well, and produces the same results when I was using orjson.

If that setting is omitted it will out default to outputting decimals and floats in their lossy native format - important for backwards compatibility.

I should note however, there is logic in a tap around the schema definition to define the correct JSON data types when it is discovered so that the records matching the schema.

Here is an example of output with use_singer_decimal set to true.

(pipelinewise-tap-oracle) [ pipelinewise-tap-oracle]$ tap-oracle -c oracle_config_test.json --catalog oracle_properties_test.json
time=2023-08-17 10:12:04 name=singer level=INFO message=Selected streams: ['STEVE-STEVE_TEST'] 
time=2023-08-17 10:12:04 name=singer level=INFO message=No currently_syncing found
time=2023-08-17 10:12:04 name=singer level=INFO message=Beginning sync of stream(STEVE-STEVE_TEST) with sync method(full)
time=2023-08-17 10:12:04 name=singer level=INFO message=Stream STEVE-STEVE_TEST is using full_table replication
time=2023-08-17 10:12:04 name=singer level=INFO message=Singer Decimal Enabled! Floats and Decimals will be output as strings
time=2023-08-17 10:12:04 name=singer level=INFO message=Selected streams: ['STEVE-STEVE_TEST'] 
time=2023-08-17 10:12:04 name=singer level=INFO message=No currently_syncing found
time=2023-08-17 10:12:04 name=singer level=INFO message=Beginning sync of stream(STEVE-STEVE_TEST) with sync method(full)
time=2023-08-17 10:12:04 name=singer level=INFO message=Stream STEVE-STEVE_TEST is using full_table replication
time=2023-08-17 10:12:04 name=singer level=INFO message=Singer Decimal Enabled! Floats and Decimals will be output as strings
{"type":"SCHEMA","stream":"STEVE-STEVE_TEST","schema":{"properties":{"COL1":{"maxLength":100,"type":["null","string"]},"COL2":{"maxLength":50,"type":["null","string"]},"DECIMAL_FIELD":{"format":"singer.decimal","type":["null","string"],"additionalProperties":{"scale_precision":"(10,2)"}},"INTEGER_FIELD":{"type":["null","integer"]},"FLOAT_FIELD":{"format":"singer.decimal","type":["null","string"]}},"type":"object"},"key_properties":[]}
time=2023-08-17 10:12:04 name=singer level=INFO message=dsn: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=10.131.2.102)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=pndww2db.moh.govt.nz)))
{"type":"STATE","value":{"bookmarks":{"STEVE-STEVE_TEST":{"last_replication_method":"FULL_TABLE","version":1692223924745}},"currently_syncing":"STEVE-STEVE_TEST"}}
{"type":"ACTIVATE_VERSION","stream":"STEVE-STEVE_TEST","version":1692223924745}
time=2023-08-17 10:12:04 name=singer level=INFO message=select SELECT  "COL1" , "COL2" , "DECIMAL_FIELD" , "INTEGER_FIELD" , "FLOAT_FIELD" , NULL as ORA_ROWSCN
                                FROM STEVE.STEVE_TEST
{"type":"RECORD","stream":"STEVE-STEVE_TEST","record":{"COL1":"Decimal Test","COL2":"201.35","DECIMAL_FIELD":"201.35","INTEGER_FIELD":null,"FLOAT_FIELD":null},"version":1692223924745,"time_extracted":"2023-08-16T22:12:04.745338Z"}
{"type":"RECORD","stream":"STEVE-STEVE_TEST","record":{"COL1":"Integer Test","COL2":"438437439439374974943","DECIMAL_FIELD":null,"INTEGER_FIELD":438437439439374974943,"FLOAT_FIELD":null},"version":1692223924745,"time_extracted":"2023-08-16T22:12:04.745338Z"}
{"type":"RECORD","stream":"STEVE-STEVE_TEST","record":{"COL1":"Float Test","COL2":"3.14159265359","DECIMAL_FIELD":null,"INTEGER_FIELD":null,"FLOAT_FIELD":"3.14159265359"},"version":1692223924745,"time_extracted":"2023-08-16T22:12:04.745338Z"}
{"type":"RECORD","stream":"STEVE-STEVE_TEST","record":{"COL1":null,"COL2":"Hello","DECIMAL_FIELD":null,"INTEGER_FIELD":null,"FLOAT_FIELD":null},"version":1692223924745,"time_extracted":"2023-08-16T22:12:04.745338Z"}
{"type":"RECORD","stream":"STEVE-STEVE_TEST","record":{"COL1":"World","COL2":null,"DECIMAL_FIELD":null,"INTEGER_FIELD":null,"FLOAT_FIELD":null},"version":1692223924745,"time_extracted":"2023-08-16T22:12:04.745338Z"}
time=2023-08-17 10:12:04 name=singer level=INFO message=METRIC: b'{"type":"counter","metric":"record_count","value":5,"tags":{"schema":"STEVE","table":"STEVE_TEST"}}'
{"type":"ACTIVATE_VERSION","stream":"STEVE-STEVE_TEST","version":1692223924745}
{"type":"STATE","value":{"bookmarks":{"STEVE-STEVE_TEST":{"last_replication_method":"FULL_TABLE","version":1692223924745,"ORA_ROWSCN":null}},"currently_syncing":null}}

Copy link

stale bot commented Aug 15, 2024

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

@edgarrmondragon
Copy link
Collaborator

I've reopened cause someone brought it up in Slack1, and commenting here on minimal requirements for shipping this:

  1. New typing wrapper, e.g. SingerDecimal
  2. New capability enum member for taps and targets to communicate support for this

As for more implementation details, I'm not a fan of using additionalProperties for this, since it's not compliant with the JSON Schema spec:

The value of the additionalProperties keyword is a schema that will be used to validate any properties in the instance that are not matched by properties or patternProperties. Setting the additionalProperties schema to false means no additional properties will be allowed.2

... but I'm willing to compromise if its use is common enough in the Singer ecosystem.

Footnotes

  1. https://meltano.slack.com/archives/C069A0GE129/p1725897289536329

  2. https://json-schema.org/understanding-json-schema/reference/object#additionalproperties

@s7clarke10
Copy link
Contributor Author

s7clarke10 commented Sep 10, 2024

Adding the comments I added to the slack channel.

So when we transfer data from databases like SQL Server, Sybase, DB2, Oracle, we contributed to the Meltano Hub default variants of these taps to allow you to emit the numeric data e.g. decimal / float data as strings i.e. not scientific notation - this requires the singer_decimal config item to be set to true. When this is set, logic in the tap also provides additional information we require to convert it from a string back into a number in the target loader once it has been transferred without any loss in precision. This is achieved by the taps having logic built-in to also give some hints as to what the original column definition was in the original source system e.g. "additionalProperties":{"scale_precision":"(38,20)".

We have our own variant of the original pipelinewise target snowflake which has been patched and a few bugs removed - https://github.com/mjsqu/pipelinewise-target-snowflake . This variant has logic to recognize the singer decimal and make use of the additional properties. If you use this variant, it will automatically convert and store the data as a number with the correct scale and precision as per the source system.

This has been very important for us as we found during our testing that there is a loss of precision if the numeric based decimal and floats were not emitted as strings via the enablement of singer decimal. I have been meaning to raise a PR to the Meltanolabs version of target-snowflake to add our logic so it could be our preferred loader.

It is worth noting for Oracle Databases however that singer decimal is harder as Oracle doesn't really set a scale and precision. It is often set as just number and the user can store a mix of integers and decimals. If this is the case we will not emit any precision and scale and that data will be transferred as a string. It would be the responsibility of the data transform e.g. dbt to set an appropriate data type as it is too dangerous to guess what is correct. We will still transfer it as a string to avoid python and scientific notation leading rounding issues.

@s7clarke10
Copy link
Contributor Author

s7clarke10 commented Sep 10, 2024

This https://github.com/mjsqu/pipelinewise-target-snowflake/blob/e2a216c3cfa3d56fe66902e611812fe146528996/target_snowflake/db_sync.py#L99-L100 section in our variant of target-snowflake which will set the correct datatype in Snowflake for singer-decimal columns.

The actual PR is here : mjsqu/pipelinewise-target-snowflake@558a0d6

A couple of notable enhancements we made to the original pipelinewise version.

  1. Fixing an issue with NULL's in an INT leading to the conversion of a Float datatype causing rounding - Integers with NULL's can lead to rounding when using the Parquet file format transferwise/pipelinewise-target-snowflake#404
  2. Adding the ability to set no proxy to allow internal routing within AWS avoid going out via the public network - Feature/boto no proxy mjsqu/pipelinewise-target-snowflake#4
  3. Ability to set a retention period for the table - Feat/disable retention mjsqu/pipelinewise-target-snowflake#27
  4. Ability to retain the column order as per the source system - Remove ordering of columns mjsqu/pipelinewise-target-snowflake#3
  5. Ability to have append and truncate options for loaded data - Feature/truncate drop option mjsqu/pipelinewise-target-snowflake#7

@edgarrmondragon
Copy link
Collaborator

Got a draft PR built on top on recent JSON schema mapping improvements: #2786

@s7clarke10
Copy link
Contributor Author

Thank you @edgarrmondragon, I think this will be a very useful feature for people building database taps and targets to receive the singer_decimal hint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants