Timezone Transformation/Casting #338

joshuajohananorthos · 2024-07-22T20:18:13Z

Feature Description

Is there a way to force the timezone in a file? I am selecting out of Snowflake into Parquet (and I have tried jsonlines as well) and it always outputs the time in UTC. The logical type in the parquet is UTC. The column is not in UTC so I am getting the incorrect time when I use the parquet.

I know timezones are the bane of using any sort of datetime field. I had looked through discord and github issues and I know I can change the column when I am selecting or run a post_sql to change it after inserting.

My understanding is that parquet is going to store timestamps in UTC. I think I am asking for something like datetime_format but only for the timezone. A source option (?) that can add the timezone info with the timestamp? It would be optional and if you supply it then it will take the datetime as given and translate it into UTC using the supplied timezone option. The output would always be UTC, but there would be an optional step of translating it before outputting UTC.

An example:

Snowflake column: `LASTUPDATED TIMESTAMP_NTZ`	No timezone option	`America\Los_Angeles` option
2024-07-21 03:24:24.508	2024-07-21 03:24:24.508 UTC -> 2024-07-20 20:24:24.508 PDT	2024-07-21 10:24:24.508 UTC -> 2024-07-21 03:24:24.508 PDT
No timezone info just a point in time.	First column datetime set to UTC in the parquet. Then ingested into a database that assumed Pacific on insertion (default Snowflake) and moved the timestamp back 7 hours.	First column datetime still set to UTC, but it starts from the correct timezone

To summarize, have a source option that will default to UTC (I think this is current behavior) that will calculate the actual UTC timestamp without having to write custom SQL. It would apply to every datetime column that cannot be inferred to a specific timezone.

I can try and see if I can add this, but I wanted to get your thoughts before starting.

The text was updated successfully, but these errors were encountered:

flarco · 2024-07-27T11:48:28Z

Hi, thanks for the write-up and details. I've put some draft changes in the parquet.go file, but thinking over, I'm not too sure how to provide the input. I'm reluctant to add yet-another env var, or a new option just for parquet time-zone. Furthermore, I was thinking to use the target option datetime_format, but what you're actually suggesting is somewhat of a transform so it doesn't fit the "format" input type. A transform would work great, but would need a parameter input as offset. Currently, transforms don't accept parameters... Let me think it over.

If you want to put those in your parquet.go and build a binary, it should theoretically work. You can just set the offset with a cast.ToInt(os.Getenv("SLING_PARQUET_TZ_OFFSET")) or something.

type ParquetWriter struct {
	Writer      *parquet.Writer
	WriterMap   *parquet.GenericWriter[map[string]any]
	tzOffset    int64
	columns     Columns
	decNumScale []*big.Rat
	recBuffer   []map[string]any
}

func (pw *ParquetWriter) SetTzOffsetValue(offset int64) {
	pw.tzOffset = offset * 60 * 60
}

func (pw *ParquetWriter) WriteRec(row []any) error {
	rec := map[string]any{}

	for i, col := range pw.columns {
		switch {
		case col.IsBool():
			rec[col.Name] = cast.ToBool(row[i]) // since is stored as string
		case col.Type == FloatType:
			rec[col.Name] = cast.ToFloat64(row[i])
		case col.Type == DecimalType:
			rec[col.Name] = StringToDecimalByteArray(cast.ToString(row[i]), pw.decNumScale[i], arrowParquet.Types.FixedLenByteArray, 16)
		case col.IsDatetime() || col.IsDate():
			switch valT := row[i].(type) {
			case time.Time:
				if row[i] != nil {
					switch col.DbPrecision {
					case 1, 2, 3:
						rec[col.Name] = valT.UnixMilli() + pw.tzOffset*1000
					case 4, 5, 6:
						rec[col.Name] = valT.UnixMicro() + pw.tzOffset*1000000
					case 7, 8, 9:
						rec[col.Name] = valT.UnixNano() + pw.tzOffset*1000000000
					default:
						rec[col.Name] = valT.UnixNano() + pw.tzOffset*1000000000
					}
				}
			default:
				rec[col.Name] = row[i]
			}
		default:
			rec[col.Name] = row[i]
		}
	}

	pw.recBuffer = append(pw.recBuffer, rec)

	if len(pw.recBuffer) == 100 {
		err := pw.writeBuffer()
		if err != nil {
			return g.Error(err, "error writing buffer")
		}
	}

	return nil
}

joshuajohananorthos · 2024-07-29T01:15:51Z

I have thought about this more and I agree that adding another env variable or option does not seem like the correct way to go. What if it wasn't parquet specific? My example was parquet and I focused on it, but it seems like it is more of a column definition problem than a file problem. This code here would fix parquets, but a json lines file would still have the incorrect date.

I checked the source and found a column type TimestampzType. I quickly checked it so I could be wrong, but it seems to be used to cast a select out of most of the databases connectors. I did not see it in parquet.go or json.go. Is there a way to use the columns definition in target options and the timestampz type?

Something like, can I infer what time zone the timestamp column is in (DB specific behavior)? If not, then does the definition in the target options tell me (Something like timestampz(-0400) )? And finally if none of the above resort to current behavior and return it in UTC.

flarco · 2024-07-29T13:00:47Z

I like timestampz(-0400) 👍 .
But then that's at the target side, which conflicts with the logical representation of the columns options, which is currently at source_options. I've been grappling with this one for a while, as well as transforms, so I think it's time for those two to be moved into the middle. See #348.
Also for the idea of a target system TimeZone as a whole (including file systems) should be possible (so that column level definition is not needed).

flarco · 2024-07-29T13:08:46Z

I am also leaning towards the transform with params. This will open more possibilities for transforms in general.
transforms: [to_timezone(-0400)] would work nicely as well, cause that applies to all columns by default.

# all columns (only applies to timestamp/datetime columns)
transforms: [to_timezone(-0400)]

# column specific
transforms: 
  my_col_ts: [to_timezone(-0400)]

joshuajohananorthos · 2024-08-15T17:56:56Z

I see that you have closed #348 and merged in changes. Does that open the door to what we discussed here? This allows defining a column type or transform at the root of options.

I can take a look at adding the transform in the middle. Another thing I have been thinking about as well is Daylight Savings. So if there is a static timezone modifier, for half the year it will be incorrect.

I have -0700 in my examples, but that is because it is Snowflake's default. I am in eastern timezone and we are -0400 or -0500.

It looks like the Go standard library has a way to load timezone location info from a string: https://pkg.go.dev/time#LoadLocation. Then you can load from America/New_York and always have the exact timezone whether it is daylight or standard time.

And with columns is there a way to specify that a column is just a timestamp and not a timestampz? That is really the root of the problem I had. It was the application of UTC to the timestamp which then later threw it off by 7 hours as Snowflake saw UTC and then applied their default timestamp.

I tried it with sling 1.2.15 and this is what it output:
2024-08-15 13:53:08 DBG using: {"columns":{"CREATEDDATETIME":"timestamp"},"transforms":null}

Here is the output of using python to inspect the parquet:

from the schema:
optional int64 field_id=-1 CREATEDDATETIME (Timestamp(isAdjustedToUTC=true, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false));
from the pyarrow table:
CREATEDDATETIME: timestamp[ns, tz=UTC]

flarco · 2024-08-18T11:19:01Z

Does that open the door to what we discussed here?

Yes, that's the plan.

Interesting, good point about DST. I agree that it should be a location name instead of the numerical offset.

So the transform will be something like to_timezone('America/New_York'), where if the column is already timestampz, it will actually change the value to the desired timezone. And if the column is timestamp (with no timezone), it will not change the value, but set the timezone and change the column type to timestampz for the DDL of target table.

About the timestamp type, yes, actually any of those values will work: https://github.com/slingdata-io/sling-cli/blob/main/core/dbio/iop/datatype.go#L51-L66, but that will not change the parquet output since that's hard-coded and writes a unix timestamp (which is always in UTC).

joshuajohananorthos · 2024-08-19T15:55:07Z

Yes, I think we are on the same page here. I am thinking that keeping the internal type as always UTC would ease in figuring out how to output. The transform function just tells Sling how to get the datetime to UTC if it is not in the db to start with.

I work mainly with MS SQL Server, Postgres, and Snowflake. All of them have a column type to store the datetime with a timezone and without.

With Timezone info

Snowflake current time in eastern - TIMESTAMP_TZ: 2024-08-19 11:30:00 -0400. Then from here either, infer based on the column type and value that it is eastern (or at least that to be stored in UTC add 4 hours) or have a transform that states it is in 'America/New_York' which would also add 4 hours. Ending with 2024-08-19 15:30:00 +0000. Then the DDL will output TIMESTAMP_TZ and insert the UTC value landing us with the same value on both sides. This also works for files as you noted about parquet.

I am not sure which of these are closer to how Sling currently works, but inferring or transforming should end in the same spot. A timestamp with a timezone that is correct in UTC. We just need how to get from the column value to UTC.

Without Timezone info

On the other side, let's look at Snowflake with no time zone info and the current time - TIMESTAMP_NTZ: 2024-08-19 11:30:00. I am assuming there is a Go type that can store a datetime without time zone info. Then this would be output in DDL as TIMESTAMP_NTZ and the exact value given would be inserted. What I am not sure about is file representations. I think JSON would be alright, but I do not know about parquet. Maybe parquet always outputs UTC and if you need to make sure it is correct, then you need to add a timezone transform?

Previous example

I will just expand this into the example I started with to work through it.

In replication.yaml there would be a line transforms: [to_timezone('America/Los_Angeles')]. This would apply to every datetime. Or if it was needed for only one column I could explicitly define the key of the column name.

This would transform the timestamp from Snowflake of 2024-07-21 03:24:24.508 to be stored as 2024-07-21 10:24:24.508 +0000 UTC internally for Go. This is the value that will be output in the parquet file.

When the parquet is ingested into Snowflake using COPY INTO and if the column is TIMESTAMP_TZ then the UTC time is saved and everything is correct. If the column is TIMESTAMP_NTZ, and a timezone is not set in the session, Snowflake will subtract 7 hours from the time and store 2024-07-21 03:24:24.508.

flarco · 2024-08-20T01:09:37Z

Maybe parquet always outputs UTC and if you need to make sure it is correct, then you need to add a timezone transform

Correct, Parquet without transform will always output UTC. With transform, should be in specific timezone.

I have added this feature for databases (not parquet). The transform is actually set_timezone (found it more fitting).
Could you test away?

$ ./sling conns exec postgres 'create table public.test6 (col1 timestamp, col2 timestampz)'

$ ./sling conns exec postgres "insert into public.test6 values ('2024-01-01 13:00:00', '2024-01-01 13:13:13 +0400')"

$ ./sling run --src-conn postgres --src-stream public.test6 --stdout
col1,col2
2024-01-01 13:00:00.000000 +00,2024-01-01 09:13:13.000000 +00


$ ./sling run --src-conn postgres --src-stream public.test6 --stdout --transforms "[ set_timezone('America/New_York') ]"
col1,col2
2024-01-01 08:00:00.000000 -05,2024-01-01 04:13:13.000000 -05

joshuajohananorthos added the enhancement New feature or request label Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timezone Transformation/Casting #338

Timezone Transformation/Casting #338

joshuajohananorthos commented Jul 22, 2024 •

edited

Loading

flarco commented Jul 27, 2024

joshuajohananorthos commented Jul 29, 2024

flarco commented Jul 29, 2024 •

edited

Loading

flarco commented Jul 29, 2024 •

edited

Loading

joshuajohananorthos commented Aug 15, 2024 •

edited

Loading

flarco commented Aug 18, 2024

joshuajohananorthos commented Aug 19, 2024

flarco commented Aug 20, 2024 •

edited

Loading

Timezone Transformation/Casting #338

Timezone Transformation/Casting #338

Comments

joshuajohananorthos commented Jul 22, 2024 • edited Loading

Feature Description

flarco commented Jul 27, 2024

joshuajohananorthos commented Jul 29, 2024

flarco commented Jul 29, 2024 • edited Loading

flarco commented Jul 29, 2024 • edited Loading

joshuajohananorthos commented Aug 15, 2024 • edited Loading

flarco commented Aug 18, 2024

joshuajohananorthos commented Aug 19, 2024

With Timezone info

Without Timezone info

Previous example

flarco commented Aug 20, 2024 • edited Loading

joshuajohananorthos commented Jul 22, 2024 •

edited

Loading

flarco commented Jul 29, 2024 •

edited

Loading

flarco commented Jul 29, 2024 •

edited

Loading

joshuajohananorthos commented Aug 15, 2024 •

edited

Loading

flarco commented Aug 20, 2024 •

edited

Loading