Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(dataobj): initial commit of value encoding #15606

Merged
merged 1 commit into from
Jan 7, 2025

Conversation

rfratto
Copy link
Member

@rfratto rfratto commented Jan 6, 2025

This commit introduces the dataobj package with initial utilities for encoding and decoding individual values within a "dataset." A dataset is the generic representation of columnar data, with a dataset.Value being one value in one page in one column. A dataset is one type of structure that will exist within a "data object."

This initial implementation includes two encodings:

  • Plain encoding (for string values), and
  • delta encoding (for signed integers).

A follow-up commit will introduce bitmap encoding for efficiently bitpacking unsigned integers.

My initial prototype of dataobj used generics rather than the dataset.Value wrapper. However, usage of generics made it difficult to write utilities that operates on multiple columns. While dataset.Value is slightly less type safe, it is significantly easier to work within the scope of a dataset.

The encoding and decoding of values is implemented to support streaming as much as possible: individual values can be encoded and passed immediately to a compression block. Streaming values minimizes the number of rows that needed to be stored in memory at once on both the write path and the read path. This constrats with the design of parquet-go, which primarily intends for an entire page of values to be buffered in memory prior to encoding and compression. The streaming approach trades off slightly slower performance for memory efficiency.

This PR is based off of the prototyping work done in my dataobj and dataobj-combined branches.

@rfratto rfratto requested a review from a team as a code owner January 6, 2025 14:39
@rfratto rfratto force-pushed the dataobj-dataset branch 5 times, most recently from 772aaec to f511078 Compare January 6, 2025 16:59
This commit introduces the dataobj package with initial utilities for
encoding and decoding individual values within a "dataset." A dataset is
the generic representation of columnar data, with a dataset.Value being
one value in one page in one column.

This initial implementation includes two encodings:

* Plain encoding (for string values), and
* delta encoding (for signed integers).

A follow-up commit will introduce bitmap encoding for efficiently
bitpacking unsigned integers.

My initial prototype of dataobj used generics rather than the
dataset.Value wrapper. However, usage of generics made it difficult to
write utilities that operates on multiple columns. While dataset.Value
is slightly less type safe, it is significantly easier to work within
the scope of a dataset.

The encoding and decoding of values is implemented to support streaming
as much as possible: individual values can be encoded and passed
immediately to a compression block. Streaming values minimizes the
number of rows that needed to be stored in memory at once on both the
write path and the read path. This constrats with the design of
parquet-go, which primarily intends for an entire page of values to be
buffered in memory prior to encoding and compression. The streaming
approach trades off slightly slower performance for memory efficiency.
require.Equal(t, numbers, actual)
}

func Fuzz_delta(f *testing.F) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL this is nice

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cyriltovena cyriltovena merged commit 9a21590 into grafana:main Jan 7, 2025
59 checks passed
@rfratto rfratto deleted the dataobj-dataset branch January 7, 2025 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants