Skip to content

Arrow: Vectorized reads of decimal columns with default values fail with IllegalArgumentException #16502

@harperjiang

Description

@harperjiang

Apache Iceberg version

main (development)

Query engine

Spark

Please describe the bug 🐞

Issue Summary

When the vectorized Arrow reader is used to read a v3 Iceberg table that has a decimal column carrying an initialDefault or writeDefault, vector allocation fails with:

java.lang.IllegalArgumentException: Cannot cast default value to FIXED[9]: 12345.6789
  at org.apache.iceberg.types.Types$NestedField.castDefault(Types.java:892)
  at org.apache.iceberg.types.Types$NestedField.<init>(Types.java:881)
  at org.apache.iceberg.types.Types$NestedField$Builder.build(Types.java:850)
  at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.getPhysicalType(VectorizedArrowReader.java:255)
  at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateFieldVector(VectorizedArrowReader.java:228)
  at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:151)

The message varies with the underlying Parquet physical encoding:

  • FIXED_LEN_BYTE_ARRAY-backed decimal → Cannot cast default value to fixed[N]: <default>

Same read path with vectorization disabled has no errors:

spark.sql.iceberg.vectorization.enabled=false

Repro

  1. Create a v3 Iceberg table with a decimal column that has a default value:
CREATE TABLE local.db.t (
  id INT,
  amount DECIMAL(5, 2) DEFAULT 0.00
) USING iceberg TBLPROPERTIES ('format-version' = '3');

INSERT INTO local.db.t VALUES (1, 1.23), (2, 4.56), (3, 7.89);
  1. Read with vectorization enabled (the default):
SET spark.sql.iceberg.vectorization.enabled=true;
SELECT * FROM local.db.t;

The query fails with the stack trace above. The failure is deterministic only when the column is not dictionary-encoded; with dictionary encoding, allocation goes through allocateDictEncodedVector and bypasses the buggy path, so small/highly-repetitive data sets may appear to read successfully.

Root cause

VectorizedArrowReader#getPhysicalType rewrites a decimal Iceberg field to its underlying physical type (int / long / fixed[N]) so the right Arrow vector class can be allocated:

physicalType = Types.NestedField.from(logicalType).ofType(type).build();

Types.NestedField.Builder.from(field) copies the field's initialDefault and writeDefault onto the builder. NestedField's constructor then calls castDefault(literal, type) against the new physical type — for a decimal default this delegates to DecimalLiteral.to(LongType | IntegerType | FixedType), which is undefined and returns null, tripping the Preconditions.checkArgument in castDefault.

Conceptually, the defaults belong to the logical (decimal) view of the column and should not flow to the physical representation — the physical type is an internal detail used only to size the Arrow vector. The non-vectorized readers (BaseParquetReaders, SparkParquetReaders, FlinkParquetReaders) all apply defaults at the logical-type layer and are unaffected.

Proposed PR for the fix: #16501

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions