chore(dataobj): initial commit of value encoding #15606

rfratto · 2025-01-06T14:39:35Z

This commit introduces the dataobj package with initial utilities for encoding and decoding individual values within a "dataset." A dataset is the generic representation of columnar data, with a dataset.Value being one value in one page in one column. A dataset is one type of structure that will exist within a "data object."

This initial implementation includes two encodings:

Plain encoding (for string values), and
delta encoding (for signed integers).

A follow-up commit will introduce bitmap encoding for efficiently bitpacking unsigned integers.

My initial prototype of dataobj used generics rather than the dataset.Value wrapper. However, usage of generics made it difficult to write utilities that operates on multiple columns. While dataset.Value is slightly less type safe, it is significantly easier to work within the scope of a dataset.

The encoding and decoding of values is implemented to support streaming as much as possible: individual values can be encoded and passed immediately to a compression block. Streaming values minimizes the number of rows that needed to be stored in memory at once on both the write path and the read path. This constrats with the design of parquet-go, which primarily intends for an entire page of values to be buffered in memory prior to encoding and compression. The streaming approach trades off slightly slower performance for memory efficiency.

This PR is based off of the prototyping work done in my dataobj and dataobj-combined branches.

This commit introduces the dataobj package with initial utilities for encoding and decoding individual values within a "dataset." A dataset is the generic representation of columnar data, with a dataset.Value being one value in one page in one column. This initial implementation includes two encodings: * Plain encoding (for string values), and * delta encoding (for signed integers). A follow-up commit will introduce bitmap encoding for efficiently bitpacking unsigned integers. My initial prototype of dataobj used generics rather than the dataset.Value wrapper. However, usage of generics made it difficult to write utilities that operates on multiple columns. While dataset.Value is slightly less type safe, it is significantly easier to work within the scope of a dataset. The encoding and decoding of values is implemented to support streaming as much as possible: individual values can be encoded and passed immediately to a compression block. Streaming values minimizes the number of rows that needed to be stored in memory at once on both the write path and the read path. This constrats with the design of parquet-go, which primarily intends for an entire page of values to be buffered in memory prior to encoding and compression. The streaming approach trades off slightly slower performance for memory efficiency.

cyriltovena · 2025-01-07T14:22:11Z

pkg/dataobj/internal/dataset/value_encoding_delta_test.go

+	require.Equal(t, numbers, actual)
+}
+
+func Fuzz_delta(f *testing.F) {


TIL this is nice

cyriltovena

LGTM

rfratto requested a review from a team as a code owner January 6, 2025 14:39

pull-request-size bot added the size/XL label Jan 6, 2025

rfratto force-pushed the dataobj-dataset branch 5 times, most recently from 772aaec to f511078 Compare January 6, 2025 16:59

rfratto force-pushed the dataobj-dataset branch from f511078 to 0a19305 Compare January 6, 2025 17:32

rfratto requested review from benclive and cyriltovena January 7, 2025 13:31

cyriltovena reviewed Jan 7, 2025

View reviewed changes

cyriltovena approved these changes Jan 7, 2025

View reviewed changes

cyriltovena merged commit 9a21590 into grafana:main Jan 7, 2025
59 checks passed

rfratto deleted the dataobj-dataset branch January 7, 2025 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(dataobj): initial commit of value encoding #15606

chore(dataobj): initial commit of value encoding #15606

rfratto commented Jan 6, 2025 •

edited

Loading

cyriltovena Jan 7, 2025

cyriltovena left a comment

chore(dataobj): initial commit of value encoding #15606

chore(dataobj): initial commit of value encoding #15606

Conversation

rfratto commented Jan 6, 2025 • edited Loading

cyriltovena Jan 7, 2025

Choose a reason for hiding this comment

cyriltovena left a comment

Choose a reason for hiding this comment

rfratto commented Jan 6, 2025 •

edited

Loading