feat(parquet): add experimental VECTOR repetition for Arrow FixedSizeList#854
feat(parquet): add experimental VECTOR repetition for Arrow FixedSizeList#854rok wants to merge 1 commit into
Conversation
3633118 to
1ce50db
Compare
4256bc5 to
3c03e6d
Compare
…List Add an experimental Parquet VECTOR FieldRepetitionType and map Arrow FixedSizeList<T, N> onto it, opt-in via pqarrow.WithVectorEncoding(). VECTOR stores fixed-shape-list data (e.g. embeddings) without per-element rep/def levels, dropping the 3-level LIST overhead for dense vectors. This implements leaf-only: a VECTOR column is a single primitive leaf carrying vector_length (vector <element-type> <name> [N]), not a nested group. Only dense, non-nullable, top-level FixedSizeList with a fixed-width primitive element is encoded as VECTOR; everything else falls back to LIST. A VECTOR leaf adds no def/rep level, so the writer counts rows as values/vector_length, keeps each vector on a single page, and the reader rebuilds the FixedSizeList from the schema. Format additions: FieldRepetitionType.VECTOR = 3 and SchemaElement.vector_length (field id 12), hand-applied to the generated parquet.go (Thrift 0.21.0 style) with parquet_vector.thrift as the IDL source of truth. Not yet in apache/parquet-format: files written with VECTOR are unreadable by readers that don't understand it.
Review notesRebased locally onto current Correctness — looks right
Worth addressing before graduating from experimental
Minor
Format / specThe Thrift choices ( |
|
Follow-up — format/spec conformance I checked the hand-applied Thrift against the in-flight parquet-format proposal.
The enum value matches, but the Secondary, lower confidence: the Parquet C++ Option B prototype ( |
3c03e6d to
6bce5b7
Compare
DO NOT MERGE. At this point this is a proposal meant to support discussion about a change to parquet format.
Rationale for this change
Arrow
FixedSizeList<T, N>(embeddings, multidimensional array scientific data, etc) round-trips through Parquet today as a standard 3-levelLIST, paying per-element repetition/definition levels for a shape that is fixed and known from the schema. On C++ we showed ~2-10x read improved performance is possible which motivates a denser encoding.This adds an experimental Parquet
VECTORrepetition type - the "Option B" design from the Fixed-size list type for Parquet proposal - that stores fixed-shape data without those inner levels.Closes #855.
What changes are included in this PR?
FieldRepetitionType.VECTOR = 3andSchemaElement.vector_length(field id 12), hand-applied to the generatedparquet/internal/gen-go/parquet/parquet.goin the existing Thrift 0.21.0 generator style;parquet/parquet_vector.thriftvendors the IDL fragment as the source of truth.VECTORleaf node (NewPrimitiveNodeLogicalVector),vector_lengthplumbing, level computation (VECTORadds no def/rep level), andNewSchemaChecked, which returns an error instead of panicking on a malformedVECTORschema.values / vector_lengthand keeps every data page on a whole-vector boundary across all write paths (WriteBatch,WriteBatchSpaced,WriteBitmapBatchSpaced, dictionary indices, FLBA); the reader supports row-ordinal seeking by value stride and rejects malformedVECTORchunks (num_valuesnot a whole multiple of, or inconsistent with, the row count).WithVectorEncoding(); eligible top-levelFixedSizeListcolumns are written asVECTORand reconstructed on read without a stored Arrow schema, and ineligible ones fall back toLIST. Works alongsideWithStoreSchema(element timezone / field metadata are restored).A
VECTORcolumn is a single primitive leaf (vector <element-type> <name> [N]), not a nested group — the leaf carriesvector_lengthand adds no definition/repetition level, so a dense vector has no inner levels.Scope: dense, non-nullable, top-level
FixedSizeListwith a fixed-width primitive element. Every otherFixedSizeListtransparently falls back toLIST; nothing that writes today changes unless the flag is set.Are these changes tested?
Yes. New tests cover:
LISTfallback for every ineligible case;VECTORdata and of malformedVECTORfiles;WithStoreSchemaround-trip; and the schema/Thrift compact-protocol round-trip.All new tests pass; the only failing tests in the suite are the pre-existing ones that need the
parquet-testingdata submodule /PARQUET_TEST_DATA.Are there any user-facing changes?
Yes:
pqarrow.WithVectorEncoding()(default off) and new schema helpersschema.NewPrimitiveNodeLogicalVector/schema.NewSchemaChecked.parquet.Repetitions.Vectorvalue;parquet.Repetitions.Undefinedshifts from3to4.VECTORare not readable by Parquet implementations that don't understand theVECTORrepetition type. This is the defining trade-off of Option B and the reason it is strictly opt-in and documented experimental, untilVECTORis standardized in apache/parquet-format.Potential follow-up: nullable vectors (spaced leaf materialization + def-level→validity collapse), struct elements, and nested vectors.