Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new optional type parameters Offset to TIMESTAMP #458

Open
ryancasburn-KAI opened this issue Oct 18, 2024 · 2 comments
Open

Add new optional type parameters Offset to TIMESTAMP #458

ryancasburn-KAI opened this issue Oct 18, 2024 · 2 comments

Comments

@ryancasburn-KAI
Copy link

Describe the enhancement requested

Hi, I'm new around here, please let me know if this request is better elsewhere.

I'd like to propose an optional type parameter called Offset to TIMESTAMP logical types.

In my common use case of Parquet files, the data is a running log with many rows, such that any one row group is unlikely to have more than a few days at a time.

The idea of the Offset parameter would be to store for each row group (in Int64) an offset from Unix epoch, then the data would be stored relative to that offset.

This provides a couple of benefits:

  1. row groups could be selectively downsized (when possible) to INT32 physical types. This could save significant amounts of file size if I understand correctly. At millisecond level accuracy, INT32 could support row groups up to ~48 days long.1
  2. The docs identify that all TIMESTAMPs, but particularly those with NANOs accuracy have range limitations due to the INT64 limitation. Adding an Offset would allow practically unlimited ranges for TIMESTAMPs.

Footnotes

  1. with an offset set in the middle of row group values, given the signed nature of INT32

@wgtmac
Copy link
Member

wgtmac commented Oct 24, 2024

Thanks for opening the issue! I think the current file size is not an issue as we have delta encoding. The problems of adding offset to row group metadata I can see so far are:

  • If we have multiple timestamp columns, we have to add one field for each. Perhaps a map<string, long> for the mapping from column_id to offset.
  • This complicates the writer and reader process as we need to do extra arithmetics to deal with the offset.
  • Usually the cutoff of row group is transparent to the users, which makes it harder to set the offset to the row group metadata.
  • We need to guarantee backward compatibility. Old readers without the knowledge of offset will result in wrong values. Adding a new timestamp type for this looks like an overkill to me.

@emkornfield
Copy link
Contributor

@ryancasburn-KAI thank you for the feature as @wgtmac stated, it seems like Delta encoding should be sufficient for this use-case with the exclusion of the Nanoseconds issue. Have you had a chance to test it out for your use-case?

For nanoseconds, I think we probably need a broader discussion to support wider time ranges (there is int96 but that is considered deprecated).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants