Skip to content

feat(vector): Support writing VECTOR to parquet and avro formats using Spark#18328

Open
rahil-c wants to merge 13 commits intoapache:masterfrom
rahil-c:rahil/vector-schema-spark-converters-parquet
Open

feat(vector): Support writing VECTOR to parquet and avro formats using Spark#18328
rahil-c wants to merge 13 commits intoapache:masterfrom
rahil-c:rahil/vector-schema-spark-converters-parquet

Conversation

@rahil-c
Copy link
Collaborator

@rahil-c rahil-c commented Mar 17, 2026

Describe the issue this Pull Request addresses

Builds on #18146 (VECTOR type in HoodieSchema) and #18190 (Spark↔HoodieSchema converters) to complete the full read/write pipeline for vector columns in Apache Hudi backed by Parquet.

Vectors are stored as Parquet FIXED_LEN_BYTE_ARRAY (little-endian, IEEE-754) rather than repeated groups.

Summary and Changelog

Write path

  • HoodieRowParquetWriteSupport: detects ArrayType columns annotated with hudi_type=VECTOR(dim, elementType) metadata and serialises them as FIXED_LEN_BYTE_ARRAY instead of a Parquet list. Dimension mismatch at write time throws HoodieException to prevent silent data corruption.
  • Handles FLOAT32, FLOAT64, INT8

Read path

  • HoodieSparkParquetReader and SparkFileFormatInternalRowReaderContext: detect FIXED_LEN_BYTE_ARRAY columns carrying hudi_type metadata and deserialise them back to Spark ArrayData.
  • HoodieFileGroupReaderBasedFileFormat: propagates vector column metadata through the file-group reader so schema is not lost during Spark's internal schema resolution.
  • VectorConversionUtils (new): shared utility extracted to eliminate duplicated byte-buffer decode logic across the two reader paths.

Schema / compatibility

  • InternalSchemaConverter: maps VectorType to/from Avro bytes with hudi_type prop, preserving dimension and element-type metadata through the Avro layer.
  • HoodieSchemaCompatibilityChecker: rejects illegal vector evolution (e.g. dimension change) rather than silently coercing.
  • HoodieSchemaComparatorForSchemaEvolution: treats vector columns as incompatible when dimension or element type differs.
  • HoodieTableMetadataUtil: skips column statistics for vector columns (min/max on raw bytes is meaningless).
  • AvroSchemaConverterWithTimestampNTZ: passes through hudi_type property on bytes fields so vector metadata survives Avro↔Spark schema round-trips.
  • Types.VectorType: adds byteSize() helper used by the write path to compute FIXED_LEN_BYTE_ARRAY length.

Tests

  • TestVectorDataSource (808 lines): end-to-end Spark functional tests covering FLOAT32, FLOAT64, INT8 across COPY_ON_WRITE and MERGE_ON_READ table types; includes column projection, schema evolution rejection, and multi-batch upsert round-trips.
  • TestHoodieSchemaCompatibility, TestHoodieSchemaComparatorForSchemaEvolution, TestHoodieTableMetadataUtil: unit tests for schema-layer changes.

Impact

  • New feature — no existing behaviour is changed for non-vector columns.
  • Parquet files written with this change store vector columns as FIXED_LEN_BYTE_ARRAY. Reading those files with an older Hudi version will surface raw bytes rather than a float array; users should upgrade readers alongside writers.
  • No public Java/Scala API changes; vector behaviour is opt-in via schema metadata.

Risk Level

Low. All changes are gated behind hudi_type=VECTOR(...) metadata presence. Tables that do not use vector columns are unaffected. New paths are covered by functional tests across both table types.

Documentation Update

A follow-up website doc page covering vector column usage (schema annotation, supported element types, Parquet layout) will be raised separately. Config changes: none.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Mar 17, 2026
@rahil-c rahil-c force-pushed the rahil/vector-schema-spark-converters-parquet branch from 79398b2 to 8adeccb Compare March 17, 2026 17:53
@rahil-c rahil-c requested review from yihua March 17, 2026 18:41
@rahil-c
Copy link
Collaborator Author

rahil-c commented Mar 17, 2026

@rahil-c to update pr overview

rahil-c and others added 10 commits March 18, 2026 16:16
…tion test

- Write path (HoodieRowParquetWriteSupport.makeWriter) now switches on
  VectorElementType (FLOAT/DOUBLE/INT8) instead of hardcoding float,
  matching the read paths
- Remove redundant detectVectorColumns call in readBaseFile by reusing
  vectorCols from requiredSchema for requestedSchema
- Add testColumnProjectionWithVector covering 3 scenarios: exclude vector,
  vector-only, and all columns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use VectorLogicalType.VECTOR_BYTE_ORDER instead of hardcoded
  ByteOrder.LITTLE_ENDIAN in all 4 locations (write support, reader,
  Scala reader context, file group format)
- Add Math.multiplyExact overflow guard for dimension * elementSize
  in HoodieRowParquetWriteSupport
- Remove unnecessary array clone in HoodieSparkParquetReader
- Add clarifying comment on non-vector column else branch
- Fix misleading "float arrays" comment to "typed arrays"
- Move inline JavaConverters import to top-level in
  SparkFileFormatInternalRowReaderContext
- Import Metadata at top level instead of fully-qualified reference
- Consolidate duplicate detectVectorColumns, replaceVectorColumnsWithBinary,
  and convertBinaryToVectorArray into SparkFileFormatInternalRowReaderContext
  companion object; HoodieFileGroupReaderBasedFileFormat now delegates
- Add Javadoc on VectorType explaining it's needed for InternalSchema
  type hierarchy (cannot reuse HoodieSchema.Vector)
- Clean up unused imports (ByteOrder, ByteBuffer, GenericArrayData,
  StructField, BinaryType, HoodieSchemaType)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e types

New tests added to TestVectorDataSource:

- testDoubleVectorRoundTrip: DOUBLE element type end-to-end (64-dim)
- testInt8VectorRoundTrip: INT8/byte element type end-to-end (256-dim)
- testMultipleVectorColumns: two vector columns (float + double) in
  same schema with selective nulls and per-column projection
- testMorTableWithVectors: MOR table type with insert + upsert,
  verifying merged view returns correct vectors
- testCowUpsertWithVectors: COW upsert (update existing + insert new)
  verifying vector values after merge
- testLargeDimensionVector: 1536-dim float vectors (OpenAI embedding
  size) to exercise large buffer allocation
- testSmallDimensionVector: 2-dim vectors with edge values
  (Float.MaxValue) to verify boundary handling
- testVectorWithNonVectorArrayColumn: vector column alongside a
  regular ArrayType(StringType) to ensure non-vector arrays are
  not incorrectly treated as vectors
- testMorWithMultipleUpserts: MOR with 3 successive upsert batches
  of DOUBLE vectors, verifying the latest value wins per key

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ix hot-path allocation

- Create shared VectorConversionUtils utility class to eliminate duplicated
  vector conversion logic across HoodieSparkParquetReader,
  SparkFileFormatInternalRowReaderContext, and HoodieFileGroupReaderBasedFileFormat
- Add explicit dimension validation in HoodieRowParquetWriteSupport to prevent
  silent data corruption when array length doesn't match declared vector dimension
- Reuse GenericInternalRow in HoodieSparkParquetReader's vector post-processing
  loop to reduce GC pressure on large scans
…eSchema.Vector] to fix Scala 2.12 type inference error
@rahil-c rahil-c force-pushed the rahil/vector-schema-spark-converters-parquet branch from 52f6db8 to 959bcd8 Compare March 18, 2026 23:17
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@rahil-c rahil-c changed the title Rahil/vector schema spark converters parquet feat(vector): Support writing VECTOR to parquet and avro formats using Spark Mar 18, 2026
@rahil-c rahil-c requested review from bvaradar and voonhous March 18, 2026 23:28
@rahil-c rahil-c force-pushed the rahil/vector-schema-spark-converters-parquet branch from 3f7e2d0 to f8ce228 Compare March 18, 2026 23:31
@rahil-c rahil-c marked this pull request as ready for review March 18, 2026 23:32
@rahil-c rahil-c requested a review from vinothchandar March 18, 2026 23:32
rahil-c and others added 2 commits March 18, 2026 17:21
- Move VectorConversionUtils import into hudi group (was misplaced in 3rdParty)
- Add blank line between hudi and 3rdParty import groups
- Add blank line between java and scala import groups

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 68.87967% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.41%. Comparing base (14a549f) to head (106fb29).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
...i/io/storage/row/HoodieRowParquetWriteSupport.java 6.66% 27 Missing and 1 partial ⚠️
...ache/hudi/io/storage/HoodieSparkParquetReader.java 26.66% 9 Missing and 2 partials ⚠️
...parquet/HoodieFileGroupReaderBasedFileFormat.scala 82.45% 3 Missing and 7 partials ⚠️
.../apache/hudi/io/storage/VectorConversionUtils.java 83.33% 3 Missing and 6 partials ⚠️
...in/java/org/apache/hudi/internal/schema/Types.java 63.63% 4 Missing and 4 partials ⚠️
...hudi/SparkFileFormatInternalRowReaderContext.scala 80.00% 0 Missing and 5 partials ⚠️
.../apache/hudi/metadata/HoodieTableMetadataUtil.java 33.33% 1 Missing and 1 partial ⚠️
...hema/HoodieSchemaComparatorForSchemaEvolution.java 83.33% 0 Missing and 1 partial ⚠️
...ommon/schema/HoodieSchemaCompatibilityChecker.java 92.30% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18328      +/-   ##
============================================
- Coverage     68.48%   68.41%   -0.07%     
- Complexity    27362    27446      +84     
============================================
  Files          2420     2424       +4     
  Lines        132127   132671     +544     
  Branches      15909    16028     +119     
============================================
+ Hits          90491    90772     +281     
- Misses        34627    34827     +200     
- Partials       7009     7072      +63     
Flag Coverage Δ
common-and-other-modules 44.33% <25.31%> (-0.04%) ⬇️
hadoop-mr-java-client 45.06% <3.33%> (-0.06%) ⬇️
spark-client-hadoop-common 48.19% <1.63%> (-0.15%) ⬇️
spark-java-tests 48.86% <68.87%> (-0.05%) ⬇️
spark-scala-tests 44.94% <19.08%> (-0.18%) ⬇️
utilities 38.55% <18.25%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ain/java/org/apache/hudi/internal/schema/Type.java 80.32% <100.00%> (+0.32%) ⬆️
...ternal/schema/convert/InternalSchemaConverter.java 89.22% <100.00%> (+0.43%) ⬆️
...quet/avro/AvroSchemaConverterWithTimestampNTZ.java 76.02% <100.00%> (+0.45%) ⬆️
...hema/HoodieSchemaComparatorForSchemaEvolution.java 88.76% <83.33%> (-0.40%) ⬇️
...ommon/schema/HoodieSchemaCompatibilityChecker.java 65.85% <92.30%> (+1.47%) ⬆️
.../apache/hudi/metadata/HoodieTableMetadataUtil.java 82.21% <33.33%> (-0.13%) ⬇️
...hudi/SparkFileFormatInternalRowReaderContext.scala 80.14% <80.00%> (+0.50%) ⬆️
...in/java/org/apache/hudi/internal/schema/Types.java 77.40% <63.63%> (-1.09%) ⬇️
.../apache/hudi/io/storage/VectorConversionUtils.java 83.33% <83.33%> (ø)
...parquet/HoodieFileGroupReaderBasedFileFormat.scala 87.33% <82.45%> (+0.02%) ⬆️
... and 2 more

... and 34 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rahil-c
Copy link
Collaborator Author

rahil-c commented Mar 19, 2026

@yihua @voonhous @balaji-varadarajan-ai will need a review from one of you guys if possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants