Skip to content

feat(parquet): add experimental VECTOR repetition for Arrow FixedSizeList#854

Draft
rok wants to merge 1 commit into
apache:mainfrom
rok:parquet-vector-fixed-size-list
Draft

feat(parquet): add experimental VECTOR repetition for Arrow FixedSizeList#854
rok wants to merge 1 commit into
apache:mainfrom
rok:parquet-vector-fixed-size-list

Conversation

@rok

@rok rok commented Jun 17, 2026

Copy link
Copy Markdown
Member

DO NOT MERGE. At this point this is a proposal meant to support discussion about a change to parquet format.

Rationale for this change

Arrow FixedSizeList<T, N> (embeddings, multidimensional array scientific data, etc) round-trips through Parquet today as a standard 3-level LIST, paying per-element repetition/definition levels for a shape that is fixed and known from the schema. On C++ we showed ~2-10x read improved performance is possible which motivates a denser encoding.

This adds an experimental Parquet VECTOR repetition type - the "Option B" design from the Fixed-size list type for Parquet proposal - that stores fixed-shape data without those inner levels.

Closes #855.

What changes are included in this PR?

  • Format: FieldRepetitionType.VECTOR = 3 and SchemaElement.vector_length (field id 12), hand-applied to the generated parquet/internal/gen-go/parquet/parquet.go in the existing Thrift 0.21.0 generator style; parquet/parquet_vector.thrift vendors the IDL fragment as the source of truth.
  • schema: a primitive VECTOR leaf node (NewPrimitiveNodeLogicalVector), vector_length plumbing, level computation (VECTOR adds no def/rep level), and NewSchemaChecked, which returns an error instead of panicking on a malformed VECTOR schema.
  • file: the column writer counts rows as values / vector_length and keeps every data page on a whole-vector boundary across all write paths (WriteBatch, WriteBatchSpaced, WriteBitmapBatchSpaced, dictionary indices, FLBA); the reader supports row-ordinal seeking by value stride and rejects malformed VECTOR chunks (num_values not a whole multiple of, or inconsistent with, the row count).
  • pqarrow: opt-in encoding via WithVectorEncoding(); eligible top-level FixedSizeList columns are written as VECTOR and reconstructed on read without a stored Arrow schema, and ineligible ones fall back to LIST. Works alongside WithStoreSchema (element timezone / field metadata are restored).

A VECTOR column is a single primitive leaf (vector <element-type> <name> [N]), not a nested group — the leaf carries vector_length and adds no definition/repetition level, so a dense vector has no inner levels.

Scope: dense, non-nullable, top-level FixedSizeList with a fixed-width primitive element. Every other FixedSizeList transparently falls back to LIST; nothing that writes today changes unless the flag is set.

Are these changes tested?

Yes. New tests cover:

  • write/read round-trips across element types (bool, int32/64, float32/64, float16, fixed_size_binary, decimal128/256, date32, timestamp) and list sizes that don't divide the write batch size;
  • whole-vector page boundaries and row accounting for every write path;
  • row-ordinal seeking on DataPageV1, DataPageV2, and via offset index;
  • compression (Snappy/Zstd) and dictionary-read;
  • the LIST fallback for every ineligible case;
  • rejection of nulls in non-nullable VECTOR data and of malformed VECTOR files;
  • WithStoreSchema round-trip; and the schema/Thrift compact-protocol round-trip.

All new tests pass; the only failing tests in the suite are the pre-existing ones that need the parquet-testing data submodule / PARQUET_TEST_DATA.

Are there any user-facing changes?

Yes:

  • New writer option pqarrow.WithVectorEncoding() (default off) and new schema helpers schema.NewPrimitiveNodeLogicalVector / schema.NewSchemaChecked.
  • A new parquet.Repetitions.Vector value; parquet.Repetitions.Undefined shifts from 3 to 4.

⚠️ Compatibility: files written with VECTOR are not readable by Parquet implementations that don't understand the VECTOR repetition type. This is the defining trade-off of Option B and the reason it is strictly opt-in and documented experimental, until VECTOR is standardized in apache/parquet-format.

Potential follow-up: nullable vectors (spaced leaf materialization + def-level→validity collapse), struct elements, and nested vectors.

@rok rok force-pushed the parquet-vector-fixed-size-list branch 3 times, most recently from 3633118 to 1ce50db Compare June 18, 2026 20:21
@rok rok changed the title feat(parquet): add experimental VECTOR repetition for Arrow FixedSizeList (Option B) feat(parquet): add experimental VECTOR repetition for Arrow FixedSizeList Jun 18, 2026
@rok rok force-pushed the parquet-vector-fixed-size-list branch 4 times, most recently from 4256bc5 to 3c03e6d Compare June 18, 2026 23:09
…List

Add an experimental Parquet VECTOR FieldRepetitionType and map Arrow
FixedSizeList<T, N> onto it, opt-in via pqarrow.WithVectorEncoding(). VECTOR
stores fixed-shape-list data (e.g. embeddings) without per-element rep/def levels,
dropping the 3-level LIST overhead for dense vectors.

This implements leaf-only: a VECTOR column is a single primitive leaf carrying
vector_length (vector <element-type> <name> [N]), not a nested group. Only dense,
non-nullable, top-level FixedSizeList with a fixed-width primitive element is
encoded as VECTOR; everything else falls back to LIST. A VECTOR leaf adds no
def/rep level, so the writer counts rows as values/vector_length, keeps each
vector on a single page, and the reader rebuilds the FixedSizeList from the
schema.

Format additions: FieldRepetitionType.VECTOR = 3 and SchemaElement.vector_length
(field id 12), hand-applied to the generated parquet.go (Thrift 0.21.0 style)
with parquet_vector.thrift as the IDL source of truth. Not yet in
apache/parquet-format: files written with VECTOR are unreadable by readers that
don't understand it.
@zeroshade

Copy link
Copy Markdown
Member

Review notes

Rebased locally onto current main (past #852 / #856 / #857) — clean, no conflicts, the generated column_writer_types.gen.go still matches its template, and the full parquet/... test suite passes. For an experimental, opt-in proposal this is in good shape; nothing below blocks it landing as a proposal. The items worth resolving before VECTOR graduates from experimental are the two panic paths and the cross-version read behavior.

Correctness — looks right

  • Row accounting (values / vector_length) is consistent across every write path via rowsForLeafValues (file/column_writer.go:664), and the whole-vector-per-page guarantee holds (alignBatchToVector + the per-batch flush in commitWriteAndCheckPageLimit).
  • Composes cleanly with fix(parquet): add WriteBatchSpacedWithError to surface spaced-write failures #852: the misaligned-batch panic is caught by defer recover() at each public writer boundary and returned as an error.
  • Reader rejection is thorough — file/row_group_reader.go:97-110 (value count must be a whole, consistent multiple of the row count, with an overflow guard), plus the stride/overflow guards in seekToRowWithValueStride and the BuildArray divisibility check.
  • The Repetitions.Undefined 3→4 shift is safe: it is an internal-only sentinel (schema/reflection.go), never serialized; on-disk repetition_type was only ever 0–2.

Worth addressing before graduating from experimental

  1. Panic-as-validation on the public file writer. vectorLengthForBatch panics on a misaligned batch (file/column_writer.go:677); through the low-level WriteBatch API this surfaces as a recover()'d "unknown error type: …". pqarrow never triggers it, but the public writer should validate n % vector_length == 0 up front and return a clean typed error.
  2. NewSchemaChecked can still panic. effectiveVectorLength (schema/column.go:58) panics on int32 overflow, reachable via buildTreeNewColumn under NewSchemaChecked, which is documented to return an error instead of panicking. Unreachable today (groups can't be VECTOR), but a latent contract break once nested vectors exist — better to thread an error out.
  3. Cross-version read is a silent misread. A VECTOR (repetition type 3) file opened by a pre-this-change reader has no case for 3 and would be read as a flat required column with N×len values / wrong row count rather than failing loudly. This is inherent to introducing a new enum value with no feature guard — worth calling out in the WithVectorEncoding docs, and ideally a read-side guard before this is non-experimental.
  4. Statistics stay element-level (file/column_writer.go:660), so page/column min-max on a VECTOR column are not row-meaningful for predicate pushdown. A format-proposal decision worth stating explicitly.

Minor

  • DataPageV1 without an offset index: the seek does a throwaway page seek and then re-seeks to 0 (file/column_reader.go:644-683) — correct result, but two scans.
  • The V1 + offset-index seek relies on FirstRowIndex() being in parent-row units; it holds, but a test would lock it in (multi-page VECTOR, V1, page index enabled, seek into later pages).
  • No explicit test for the WriteBatchSpacedWithError + VECTOR error path, or for a malformed vector_length.
  • Read reconstruction without WithStoreSchema names the element "element" and drops element/list metadata (both restored when the schema is stored) — fine for Phase 1, worth a doc note.
  • pqarrow/encode_arrow.go now returns err instead of swallowing it on builder-creation failure — good catch; unrelated to VECTOR but correct.

Format / spec

The Thrift choices (FieldRepetitionType.VECTOR = 3, SchemaElement.vector_length = field id 12) are hand-applied. Worth confirming these match the apache/parquet-format proposal exactly — if the standard settles on different values, files written now become silently incompatible.

@zeroshade

Copy link
Copy Markdown
Member

Follow-up — format/spec conformance

I checked the hand-applied Thrift against the in-flight parquet-format proposal. VECTOR is not in apache/parquet-format master yet (FieldRepetitionType there is still just REQUIRED/OPTIONAL/REPEATED). Comparing against the current Option B draft (Antoine Pitrou's vector-repetition branch):

FieldRepetitionType.VECTOR SchemaElement.vector_length field id
This PR (parquet/parquet_vector.thrift) 3 12
parquet-format draft (pitrou:vector-repetition) 3 11

The enum value matches, but the vector_length Thrift field id differs (12 here vs 11 in the draft). A reader built to the draft would skip the unknown field 12, then see repetition_type = VECTOR with no vector_length and reject the file as malformed — so data written now wouldn't interop with a draft-conformant reader. Since cross-implementation compatibility is the whole crux of Option B, it'd be worth aligning the field id with the proposal (or calling out the divergence explicitly) before any files get written.

Secondary, lower confidence: the Parquet C++ Option B prototype (rok/arrow#51) appears to model VECTOR as a three-level group carrying a dedicated Vector logical type, whereas this PR uses the primitive-leaf "reduced Option B" with no logical annotation — another representational difference to reconcile as the proposal converges.

@zeroshade zeroshade force-pushed the parquet-vector-fixed-size-list branch from 3c03e6d to 6bce5b7 Compare June 19, 2026 02:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support an experimental Parquet VECTOR repetition type for Arrow FixedSizeList

2 participants