feat(vector): Support writing VECTOR to parquet and avro formats using Spark by rahil-c · Pull Request #18328 · apache/hudi

rahil-c · 2026-03-17T02:59:27Z

Describe the issue this Pull Request addresses

Builds on #18146 (VECTOR type in HoodieSchema) and #18190 (Spark↔HoodieSchema converters) to complete the full read/write pipeline for vector columns in Apache Hudi backed by Parquet.

Vectors are stored as Parquet FIXED_LEN_BYTE_ARRAY (little-endian, IEEE-754) rather than repeated groups.

Summary and Changelog

Write path

HoodieRowParquetWriteSupport: detects ArrayType columns annotated with hudi_type=VECTOR(dim, elementType) metadata and serialises them as FIXED_LEN_BYTE_ARRAY instead of a Parquet list. Dimension mismatch at write time throws HoodieException to prevent silent data corruption.
Handles FLOAT32, FLOAT64, INT8

Read path

HoodieSparkParquetReader and SparkFileFormatInternalRowReaderContext: detect FIXED_LEN_BYTE_ARRAY columns carrying hudi_type metadata and deserialise them back to Spark ArrayData.
HoodieFileGroupReaderBasedFileFormat: propagates vector column metadata through the file-group reader so schema is not lost during Spark's internal schema resolution.
VectorConversionUtils (new): shared utility extracted to eliminate duplicated byte-buffer decode logic across the two reader paths.

Schema / compatibility

InternalSchemaConverter: maps VectorType to/from Avro bytes with hudi_type prop, preserving dimension and element-type metadata through the Avro layer.
HoodieSchemaCompatibilityChecker: rejects illegal vector evolution (e.g. dimension change) rather than silently coercing.
HoodieSchemaComparatorForSchemaEvolution: treats vector columns as incompatible when dimension or element type differs.
HoodieTableMetadataUtil: skips column statistics for vector columns (min/max on raw bytes is meaningless).
AvroSchemaConverterWithTimestampNTZ: passes through hudi_type property on bytes fields so vector metadata survives Avro↔Spark schema round-trips.
Types.VectorType: adds byteSize() helper used by the write path to compute FIXED_LEN_BYTE_ARRAY length.

Tests

TestVectorDataSource (808 lines): end-to-end Spark functional tests covering FLOAT32, FLOAT64, INT8 across COPY_ON_WRITE and MERGE_ON_READ table types; includes column projection, schema evolution rejection, and multi-batch upsert round-trips.
TestHoodieSchemaCompatibility, TestHoodieSchemaComparatorForSchemaEvolution, TestHoodieTableMetadataUtil: unit tests for schema-layer changes.

Impact

New feature — no existing behaviour is changed for non-vector columns.
Parquet files written with this change store vector columns as FIXED_LEN_BYTE_ARRAY. Reading those files with an older Hudi version will surface raw bytes rather than a float array; users should upgrade readers alongside writers.
No public Java/Scala API changes; vector behaviour is opt-in via schema metadata.

Risk Level

Low. All changes are gated behind hudi_type=VECTOR(...) metadata presence. Tables that do not use vector columns are unaffected. New paths are covered by functional tests across both table types.

Documentation Update

A follow-up website doc page covering vector column usage (schema annotation, supported element types, Parquet layout) will be raised separately. Config changes: none.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

rahil-c · 2026-03-17T18:43:04Z

@rahil-c to update pr overview

…read them back

…tion test - Write path (HoodieRowParquetWriteSupport.makeWriter) now switches on VectorElementType (FLOAT/DOUBLE/INT8) instead of hardcoding float, matching the read paths - Remove redundant detectVectorColumns call in readBaseFile by reusing vectorCols from requiredSchema for requestedSchema - Add testColumnProjectionWithVector covering 3 scenarios: exclude vector, vector-only, and all columns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Use VectorLogicalType.VECTOR_BYTE_ORDER instead of hardcoded ByteOrder.LITTLE_ENDIAN in all 4 locations (write support, reader, Scala reader context, file group format) - Add Math.multiplyExact overflow guard for dimension * elementSize in HoodieRowParquetWriteSupport - Remove unnecessary array clone in HoodieSparkParquetReader - Add clarifying comment on non-vector column else branch - Fix misleading "float arrays" comment to "typed arrays" - Move inline JavaConverters import to top-level in SparkFileFormatInternalRowReaderContext - Import Metadata at top level instead of fully-qualified reference - Consolidate duplicate detectVectorColumns, replaceVectorColumnsWithBinary, and convertBinaryToVectorArray into SparkFileFormatInternalRowReaderContext companion object; HoodieFileGroupReaderBasedFileFormat now delegates - Add Javadoc on VectorType explaining it's needed for InternalSchema type hierarchy (cannot reuse HoodieSchema.Vector) - Clean up unused imports (ByteOrder, ByteBuffer, GenericArrayData, StructField, BinaryType, HoodieSchemaType) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e types New tests added to TestVectorDataSource: - testDoubleVectorRoundTrip: DOUBLE element type end-to-end (64-dim) - testInt8VectorRoundTrip: INT8/byte element type end-to-end (256-dim) - testMultipleVectorColumns: two vector columns (float + double) in same schema with selective nulls and per-column projection - testMorTableWithVectors: MOR table type with insert + upsert, verifying merged view returns correct vectors - testCowUpsertWithVectors: COW upsert (update existing + insert new) verifying vector values after merge - testLargeDimensionVector: 1536-dim float vectors (OpenAI embedding size) to exercise large buffer allocation - testSmallDimensionVector: 2-dim vectors with edge values (Float.MaxValue) to verify boundary handling - testVectorWithNonVectorArrayColumn: vector column alongside a regular ArrayType(StringType) to ensure non-vector arrays are not incorrectly treated as vectors - testMorWithMultipleUpserts: MOR with 3 successive upsert batches of DOUBLE vectors, verifying the latest value wins per key Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ix hot-path allocation - Create shared VectorConversionUtils utility class to eliminate duplicated vector conversion logic across HoodieSparkParquetReader, SparkFileFormatInternalRowReaderContext, and HoodieFileGroupReaderBasedFileFormat - Add explicit dimension validation in HoodieRowParquetWriteSupport to prevent silent data corruption when array length doesn't match declared vector dimension - Reuse GenericInternalRow in HoodieSparkParquetReader's vector post-processing loop to reduce GC pressure on large scans

…eSchema.Vector] to fix Scala 2.12 type inference error

hudi-bot · 2026-03-18T23:20:49Z

CI report:

52f6db8 Azure: FAILURE
959bcd8 UNKNOWN

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

- Move VectorConversionUtils import into hudi group (was misplaced in 3rdParty) - Add blank line between hudi and 3rdParty import groups - Add blank line between java and scala import groups Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-03-19T09:26:09Z

Codecov Report

❌ Patch coverage is 68.87967% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.41%. Comparing base (14a549f) to head (106fb29).
⚠️ Report is 4 commits behind head on master.

Files with missing lines	Patch %	Lines
...i/io/storage/row/HoodieRowParquetWriteSupport.java	6.66%	27 Missing and 1 partial ⚠️
...ache/hudi/io/storage/HoodieSparkParquetReader.java	26.66%	9 Missing and 2 partials ⚠️
...parquet/HoodieFileGroupReaderBasedFileFormat.scala	82.45%	3 Missing and 7 partials ⚠️
.../apache/hudi/io/storage/VectorConversionUtils.java	83.33%	3 Missing and 6 partials ⚠️
...in/java/org/apache/hudi/internal/schema/Types.java	63.63%	4 Missing and 4 partials ⚠️
...hudi/SparkFileFormatInternalRowReaderContext.scala	80.00%	0 Missing and 5 partials ⚠️
.../apache/hudi/metadata/HoodieTableMetadataUtil.java	33.33%	1 Missing and 1 partial ⚠️
...hema/HoodieSchemaComparatorForSchemaEvolution.java	83.33%	0 Missing and 1 partial ⚠️
...ommon/schema/HoodieSchemaCompatibilityChecker.java	92.30%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18328      +/-   ##
============================================
- Coverage     68.48%   68.41%   -0.07%     
- Complexity    27362    27446      +84     
============================================
  Files          2420     2424       +4     
  Lines        132127   132671     +544     
  Branches      15909    16028     +119     
============================================
+ Hits          90491    90772     +281     
- Misses        34627    34827     +200     
- Partials       7009     7072      +63

Flag	Coverage Δ
common-and-other-modules	`44.33% <25.31%> (-0.04%)`	⬇️
hadoop-mr-java-client	`45.06% <3.33%> (-0.06%)`	⬇️
spark-client-hadoop-common	`48.19% <1.63%> (-0.15%)`	⬇️
spark-java-tests	`48.86% <68.87%> (-0.05%)`	⬇️
spark-scala-tests	`44.94% <19.08%> (-0.18%)`	⬇️
utilities	`38.55% <18.25%> (-0.15%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...ain/java/org/apache/hudi/internal/schema/Type.java	`80.32% <100.00%> (+0.32%)`	⬆️
...ternal/schema/convert/InternalSchemaConverter.java	`89.22% <100.00%> (+0.43%)`	⬆️
...quet/avro/AvroSchemaConverterWithTimestampNTZ.java	`76.02% <100.00%> (+0.45%)`	⬆️
...hema/HoodieSchemaComparatorForSchemaEvolution.java	`88.76% <83.33%> (-0.40%)`	⬇️
...ommon/schema/HoodieSchemaCompatibilityChecker.java	`65.85% <92.30%> (+1.47%)`	⬆️
.../apache/hudi/metadata/HoodieTableMetadataUtil.java	`82.21% <33.33%> (-0.13%)`	⬇️
...hudi/SparkFileFormatInternalRowReaderContext.scala	`80.14% <80.00%> (+0.50%)`	⬆️
...in/java/org/apache/hudi/internal/schema/Types.java	`77.40% <63.63%> (-1.09%)`	⬇️
.../apache/hudi/io/storage/VectorConversionUtils.java	`83.33% <83.33%> (ø)`
...parquet/HoodieFileGroupReaderBasedFileFormat.scala	`87.33% <82.45%> (+0.02%)`	⬆️
... and 2 more

... and 34 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rahil-c · 2026-03-19T16:15:25Z

@yihua @voonhous @balaji-varadarajan-ai will need a review from one of you guys if possible

github-actions bot added the size:XL PR with lines of changes > 1000 label Mar 17, 2026

rahil-c force-pushed the rahil/vector-schema-spark-converters-parquet branch from 79398b2 to 8adeccb Compare March 17, 2026 17:53

rahil-c requested review from yihua March 17, 2026 18:41

rahil-c and others added 10 commits March 18, 2026 16:16

futher changes for getting parquet to write hudi vectors

c46bfb9

trying to resolve type issues during read/write

5b26607

finally able to write from spark to hudi to parquet vectors and then …

d8a7dca

…read them back

get things working

aa3be52

Handle vector schema checks and skip vector column stats

35223ec

fix(vector): replace existential type Map[Int, _] with Map[Int, Hoodi…

959bcd8

…eSchema.Vector] to fix Scala 2.12 type inference error

rahil-c force-pushed the rahil/vector-schema-spark-converters-parquet branch from 52f6db8 to 959bcd8 Compare March 18, 2026 23:17

rahil-c changed the title ~~Rahil/vector schema spark converters parquet~~ feat(vector): Support writing VECTOR to parquet and avro formats using Spark Mar 18, 2026

rahil-c requested review from bvaradar and voonhous March 18, 2026 23:28

minor comilation issue fix

f8ce228

rahil-c force-pushed the rahil/vector-schema-spark-converters-parquet branch from 3f7e2d0 to f8ce228 Compare March 18, 2026 23:31

rahil-c marked this pull request as ready for review March 18, 2026 23:32

rahil-c requested a review from vinothchandar March 18, 2026 23:32

rahil-c and others added 2 commits March 18, 2026 17:21

further self review fixes

fd8fca7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vector): Support writing VECTOR to parquet and avro formats using Spark#18328

feat(vector): Support writing VECTOR to parquet and avro formats using Spark#18328
rahil-c wants to merge 13 commits intoapache:masterfrom
rahil-c:rahil/vector-schema-spark-converters-parquet

rahil-c commented Mar 17, 2026 •

edited

Loading

Uh oh!

rahil-c commented Mar 17, 2026

Uh oh!

hudi-bot commented Mar 18, 2026

Uh oh!

codecov-commenter commented Mar 19, 2026

Uh oh!

rahil-c commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rahil-c commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

rahil-c commented Mar 17, 2026

Uh oh!

hudi-bot commented Mar 18, 2026

CI report:

Uh oh!

codecov-commenter commented Mar 19, 2026

Codecov Report

Uh oh!

rahil-c commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rahil-c commented Mar 17, 2026 •

edited

Loading