You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm new around here, please let me know if this request is better elsewhere.
I'd like to propose an optional type parameter called Offset to TIMESTAMP logical types.
In my common use case of Parquet files, the data is a running log with many rows, such that any one row group is unlikely to have more than a few days at a time.
The idea of the Offset parameter would be to store for each row group (in Int64) an offset from Unix epoch, then the data would be stored relative to that offset.
This provides a couple of benefits:
row groups could be selectively downsized (when possible) to INT32 physical types. This could save significant amounts of file size if I understand correctly. At millisecond level accuracy, INT32 could support row groups up to ~48 days long.1
The docs identify that all TIMESTAMPs, but particularly those with NANOs accuracy have range limitations due to the INT64 limitation. Adding an Offset would allow practically unlimited ranges for TIMESTAMPs.
Footnotes
with an offset set in the middle of row group values, given the signed nature of INT32 ↩
The text was updated successfully, but these errors were encountered:
Thanks for opening the issue! I think the current file size is not an issue as we have delta encoding. The problems of adding offset to row group metadata I can see so far are:
If we have multiple timestamp columns, we have to add one field for each. Perhaps a map<string, long> for the mapping from column_id to offset.
This complicates the writer and reader process as we need to do extra arithmetics to deal with the offset.
Usually the cutoff of row group is transparent to the users, which makes it harder to set the offset to the row group metadata.
We need to guarantee backward compatibility. Old readers without the knowledge of offset will result in wrong values. Adding a new timestamp type for this looks like an overkill to me.
@ryancasburn-KAI thank you for the feature as @wgtmac stated, it seems like Delta encoding should be sufficient for this use-case with the exclusion of the Nanoseconds issue. Have you had a chance to test it out for your use-case?
For nanoseconds, I think we probably need a broader discussion to support wider time ranges (there is int96 but that is considered deprecated).
Describe the enhancement requested
Hi, I'm new around here, please let me know if this request is better elsewhere.
I'd like to propose an optional type parameter called
Offset
to TIMESTAMP logical types.In my common use case of Parquet files, the data is a running log with many rows, such that any one row group is unlikely to have more than a few days at a time.
The idea of the
Offset
parameter would be to store for each row group (in Int64) an offset from Unix epoch, then the data would be stored relative to that offset.This provides a couple of benefits:
Offset
would allow practically unlimited ranges for TIMESTAMPs.Footnotes
with an offset set in the middle of row group values, given the signed nature of INT32 ↩
The text was updated successfully, but these errors were encountered: