Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timezone Transformation/Casting #338

Open
joshuajohananorthos opened this issue Jul 22, 2024 · 8 comments
Open

Timezone Transformation/Casting #338

joshuajohananorthos opened this issue Jul 22, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@joshuajohananorthos
Copy link

joshuajohananorthos commented Jul 22, 2024

Feature Description

Is there a way to force the timezone in a file? I am selecting out of Snowflake into Parquet (and I have tried jsonlines as well) and it always outputs the time in UTC. The logical type in the parquet is UTC. The column is not in UTC so I am getting the incorrect time when I use the parquet.

I know timezones are the bane of using any sort of datetime field. I had looked through discord and github issues and I know I can change the column when I am selecting or run a post_sql to change it after inserting.

My understanding is that parquet is going to store timestamps in UTC. I think I am asking for something like datetime_format but only for the timezone. A source option (?) that can add the timezone info with the timestamp? It would be optional and if you supply it then it will take the datetime as given and translate it into UTC using the supplied timezone option. The output would always be UTC, but there would be an optional step of translating it before outputting UTC.

An example:

Snowflake column: LASTUPDATED TIMESTAMP_NTZ No timezone option America\Los_Angeles option
2024-07-21 03:24:24.508 2024-07-21 03:24:24.508 UTC -> 2024-07-20 20:24:24.508 PDT 2024-07-21 10:24:24.508 UTC -> 2024-07-21 03:24:24.508 PDT
No timezone info just a point in time. First column datetime set to UTC in the parquet. Then ingested into a database that assumed Pacific on insertion (default Snowflake) and moved the timestamp back 7 hours. First column datetime still set to UTC, but it starts from the correct timezone

To summarize, have a source option that will default to UTC (I think this is current behavior) that will calculate the actual UTC timestamp without having to write custom SQL. It would apply to every datetime column that cannot be inferred to a specific timezone.

I can try and see if I can add this, but I wanted to get your thoughts before starting.

@joshuajohananorthos joshuajohananorthos added the enhancement New feature or request label Jul 22, 2024
@flarco
Copy link
Collaborator

flarco commented Jul 27, 2024

Hi, thanks for the write-up and details. I've put some draft changes in the parquet.go file, but thinking over, I'm not too sure how to provide the input. I'm reluctant to add yet-another env var, or a new option just for parquet time-zone. Furthermore, I was thinking to use the target option datetime_format, but what you're actually suggesting is somewhat of a transform so it doesn't fit the "format" input type. A transform would work great, but would need a parameter input as offset. Currently, transforms don't accept parameters... Let me think it over.

If you want to put those in your parquet.go and build a binary, it should theoretically work. You can just set the offset with a cast.ToInt(os.Getenv("SLING_PARQUET_TZ_OFFSET")) or something.

type ParquetWriter struct {
	Writer      *parquet.Writer
	WriterMap   *parquet.GenericWriter[map[string]any]
	tzOffset    int64
	columns     Columns
	decNumScale []*big.Rat
	recBuffer   []map[string]any
}

func (pw *ParquetWriter) SetTzOffsetValue(offset int64) {
	pw.tzOffset = offset * 60 * 60
}

func (pw *ParquetWriter) WriteRec(row []any) error {
	rec := map[string]any{}

	for i, col := range pw.columns {
		switch {
		case col.IsBool():
			rec[col.Name] = cast.ToBool(row[i]) // since is stored as string
		case col.Type == FloatType:
			rec[col.Name] = cast.ToFloat64(row[i])
		case col.Type == DecimalType:
			rec[col.Name] = StringToDecimalByteArray(cast.ToString(row[i]), pw.decNumScale[i], arrowParquet.Types.FixedLenByteArray, 16)
		case col.IsDatetime() || col.IsDate():
			switch valT := row[i].(type) {
			case time.Time:
				if row[i] != nil {
					switch col.DbPrecision {
					case 1, 2, 3:
						rec[col.Name] = valT.UnixMilli() + pw.tzOffset*1000
					case 4, 5, 6:
						rec[col.Name] = valT.UnixMicro() + pw.tzOffset*1000000
					case 7, 8, 9:
						rec[col.Name] = valT.UnixNano() + pw.tzOffset*1000000000
					default:
						rec[col.Name] = valT.UnixNano() + pw.tzOffset*1000000000
					}
				}
			default:
				rec[col.Name] = row[i]
			}
		default:
			rec[col.Name] = row[i]
		}
	}

	pw.recBuffer = append(pw.recBuffer, rec)

	if len(pw.recBuffer) == 100 {
		err := pw.writeBuffer()
		if err != nil {
			return g.Error(err, "error writing buffer")
		}
	}

	return nil
}

@joshuajohananorthos
Copy link
Author

I have thought about this more and I agree that adding another env variable or option does not seem like the correct way to go. What if it wasn't parquet specific? My example was parquet and I focused on it, but it seems like it is more of a column definition problem than a file problem. This code here would fix parquets, but a json lines file would still have the incorrect date.

I checked the source and found a column type TimestampzType. I quickly checked it so I could be wrong, but it seems to be used to cast a select out of most of the databases connectors. I did not see it in parquet.go or json.go. Is there a way to use the columns definition in target options and the timestampz type?

Something like, can I infer what time zone the timestamp column is in (DB specific behavior)? If not, then does the definition in the target options tell me (Something like timestampz(-0400) )? And finally if none of the above resort to current behavior and return it in UTC.

@flarco
Copy link
Collaborator

flarco commented Jul 29, 2024

I like timestampz(-0400) 👍 .
But then that's at the target side, which conflicts with the logical representation of the columns options, which is currently at source_options. I've been grappling with this one for a while, as well as transforms, so I think it's time for those two to be moved into the middle. See #348.
Also for the idea of a target system TimeZone as a whole (including file systems) should be possible (so that column level definition is not needed).

@flarco
Copy link
Collaborator

flarco commented Jul 29, 2024

I am also leaning towards the transform with params. This will open more possibilities for transforms in general.
transforms: [to_timezone(-0400)] would work nicely as well, cause that applies to all columns by default.

# all columns (only applies to timestamp/datetime columns)
transforms: [to_timezone(-0400)]

# column specific
transforms: 
  my_col_ts: [to_timezone(-0400)]

@joshuajohananorthos
Copy link
Author

joshuajohananorthos commented Aug 15, 2024

I see that you have closed #348 and merged in changes. Does that open the door to what we discussed here? This allows defining a column type or transform at the root of options.

I can take a look at adding the transform in the middle. Another thing I have been thinking about as well is Daylight Savings. So if there is a static timezone modifier, for half the year it will be incorrect.

I have -0700 in my examples, but that is because it is Snowflake's default. I am in eastern timezone and we are -0400 or -0500.

It looks like the Go standard library has a way to load timezone location info from a string: https://pkg.go.dev/time#LoadLocation. Then you can load from America/New_York and always have the exact timezone whether it is daylight or standard time.

And with columns is there a way to specify that a column is just a timestamp and not a timestampz? That is really the root of the problem I had. It was the application of UTC to the timestamp which then later threw it off by 7 hours as Snowflake saw UTC and then applied their default timestamp.

I tried it with sling 1.2.15 and this is what it output:
2024-08-15 13:53:08 DBG using: {"columns":{"CREATEDDATETIME":"timestamp"},"transforms":null}

Here is the output of using python to inspect the parquet:

from the schema:
optional int64 field_id=-1 CREATEDDATETIME (Timestamp(isAdjustedToUTC=true, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false));
from the pyarrow table:
CREATEDDATETIME: timestamp[ns, tz=UTC]

@flarco
Copy link
Collaborator

flarco commented Aug 18, 2024

Does that open the door to what we discussed here?

Yes, that's the plan.

Interesting, good point about DST. I agree that it should be a location name instead of the numerical offset.

So the transform will be something like to_timezone('America/New_York'), where if the column is already timestampz, it will actually change the value to the desired timezone. And if the column is timestamp (with no timezone), it will not change the value, but set the timezone and change the column type to timestampz for the DDL of target table.

About the timestamp type, yes, actually any of those values will work: https://github.com/slingdata-io/sling-cli/blob/main/core/dbio/iop/datatype.go#L51-L66, but that will not change the parquet output since that's hard-coded and writes a unix timestamp (which is always in UTC).

@joshuajohananorthos
Copy link
Author

Yes, I think we are on the same page here. I am thinking that keeping the internal type as always UTC would ease in figuring out how to output. The transform function just tells Sling how to get the datetime to UTC if it is not in the db to start with.

I work mainly with MS SQL Server, Postgres, and Snowflake. All of them have a column type to store the datetime with a timezone and without.

With Timezone info

Snowflake current time in eastern - TIMESTAMP_TZ: 2024-08-19 11:30:00 -0400. Then from here either, infer based on the column type and value that it is eastern (or at least that to be stored in UTC add 4 hours) or have a transform that states it is in 'America/New_York' which would also add 4 hours. Ending with 2024-08-19 15:30:00 +0000. Then the DDL will output TIMESTAMP_TZ and insert the UTC value landing us with the same value on both sides. This also works for files as you noted about parquet.

I am not sure which of these are closer to how Sling currently works, but inferring or transforming should end in the same spot. A timestamp with a timezone that is correct in UTC. We just need how to get from the column value to UTC.

Without Timezone info

On the other side, let's look at Snowflake with no time zone info and the current time - TIMESTAMP_NTZ: 2024-08-19 11:30:00. I am assuming there is a Go type that can store a datetime without time zone info. Then this would be output in DDL as TIMESTAMP_NTZ and the exact value given would be inserted. What I am not sure about is file representations. I think JSON would be alright, but I do not know about parquet. Maybe parquet always outputs UTC and if you need to make sure it is correct, then you need to add a timezone transform?

Previous example

I will just expand this into the example I started with to work through it.

In replication.yaml there would be a line transforms: [to_timezone('America/Los_Angeles')]. This would apply to every datetime. Or if it was needed for only one column I could explicitly define the key of the column name.

This would transform the timestamp from Snowflake of 2024-07-21 03:24:24.508 to be stored as 2024-07-21 10:24:24.508 +0000 UTC internally for Go. This is the value that will be output in the parquet file.

When the parquet is ingested into Snowflake using COPY INTO and if the column is TIMESTAMP_TZ then the UTC time is saved and everything is correct. If the column is TIMESTAMP_NTZ, and a timezone is not set in the session, Snowflake will subtract 7 hours from the time and store 2024-07-21 03:24:24.508.

@flarco
Copy link
Collaborator

flarco commented Aug 20, 2024

Maybe parquet always outputs UTC and if you need to make sure it is correct, then you need to add a timezone transform

Correct, Parquet without transform will always output UTC. With transform, should be in specific timezone.

I have added this feature for databases (not parquet). The transform is actually set_timezone (found it more fitting).
Could you test away?

$ ./sling conns exec postgres 'create table public.test6 (col1 timestamp, col2 timestampz)'

$ ./sling conns exec postgres "insert into public.test6 values ('2024-01-01 13:00:00', '2024-01-01 13:13:13 +0400')"

$ ./sling run --src-conn postgres --src-stream public.test6 --stdout
col1,col2
2024-01-01 13:00:00.000000 +00,2024-01-01 09:13:13.000000 +00


$ ./sling run --src-conn postgres --src-stream public.test6 --stdout --transforms "[ set_timezone('America/New_York') ]"
col1,col2
2024-01-01 08:00:00.000000 -05,2024-01-01 04:13:13.000000 -05

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants