Apache Iceberg version
main (development)
Query engine
Spark
Please describe the bug 🐞
Issue Summary
When the vectorized Arrow reader is used to read a v3 Iceberg table that has a decimal column carrying an initialDefault or writeDefault, vector allocation fails with:
java.lang.IllegalArgumentException: Cannot cast default value to FIXED[9]: 12345.6789
at org.apache.iceberg.types.Types$NestedField.castDefault(Types.java:892)
at org.apache.iceberg.types.Types$NestedField.<init>(Types.java:881)
at org.apache.iceberg.types.Types$NestedField$Builder.build(Types.java:850)
at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.getPhysicalType(VectorizedArrowReader.java:255)
at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateFieldVector(VectorizedArrowReader.java:228)
at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:151)
The message varies with the underlying Parquet physical encoding:
FIXED_LEN_BYTE_ARRAY-backed decimal → Cannot cast default value to fixed[N]: <default>
Same read path with vectorization disabled has no errors:
spark.sql.iceberg.vectorization.enabled=false
Repro
- Create a v3 Iceberg table with a decimal column that has a default value:
CREATE TABLE local.db.t (
id INT,
amount DECIMAL(5, 2) DEFAULT 0.00
) USING iceberg TBLPROPERTIES ('format-version' = '3');
INSERT INTO local.db.t VALUES (1, 1.23), (2, 4.56), (3, 7.89);
- Read with vectorization enabled (the default):
SET spark.sql.iceberg.vectorization.enabled=true;
SELECT * FROM local.db.t;
The query fails with the stack trace above. The failure is deterministic only when the column is not dictionary-encoded; with dictionary encoding, allocation goes through allocateDictEncodedVector and bypasses the buggy path, so small/highly-repetitive data sets may appear to read successfully.
Root cause
VectorizedArrowReader#getPhysicalType rewrites a decimal Iceberg field to its underlying physical type (int / long / fixed[N]) so the right Arrow vector class can be allocated:
physicalType = Types.NestedField.from(logicalType).ofType(type).build();
Types.NestedField.Builder.from(field) copies the field's initialDefault and writeDefault onto the builder. NestedField's constructor then calls castDefault(literal, type) against the new physical type — for a decimal default this delegates to DecimalLiteral.to(LongType | IntegerType | FixedType), which is undefined and returns null, tripping the Preconditions.checkArgument in castDefault.
Conceptually, the defaults belong to the logical (decimal) view of the column and should not flow to the physical representation — the physical type is an internal detail used only to size the Arrow vector. The non-vectorized readers (BaseParquetReaders, SparkParquetReaders, FlinkParquetReaders) all apply defaults at the logical-type layer and are unaffected.
Proposed PR for the fix: #16501
Willingness to contribute
Apache Iceberg version
main (development)
Query engine
Spark
Please describe the bug 🐞
Issue Summary
When the vectorized Arrow reader is used to read a v3 Iceberg table that has a
decimalcolumn carrying aninitialDefaultorwriteDefault, vector allocation fails with:The message varies with the underlying Parquet physical encoding:
FIXED_LEN_BYTE_ARRAY-backed decimal →Cannot cast default value to fixed[N]: <default>Same read path with vectorization disabled has no errors:
Repro
The query fails with the stack trace above. The failure is deterministic only when the column is not dictionary-encoded; with dictionary encoding, allocation goes through
allocateDictEncodedVectorand bypasses the buggy path, so small/highly-repetitive data sets may appear to read successfully.Root cause
VectorizedArrowReader#getPhysicalTyperewrites a decimal Iceberg field to its underlying physical type (int/long/fixed[N]) so the right Arrow vector class can be allocated:Types.NestedField.Builder.from(field)copies the field'sinitialDefaultandwriteDefaultonto the builder.NestedField's constructor then callscastDefault(literal, type)against the new physical type — for a decimal default this delegates toDecimalLiteral.to(LongType | IntegerType | FixedType), which is undefined and returnsnull, tripping thePreconditions.checkArgumentincastDefault.Conceptually, the defaults belong to the logical (decimal) view of the column and should not flow to the physical representation — the physical type is an internal detail used only to size the Arrow vector. The non-vectorized readers (
BaseParquetReaders,SparkParquetReaders,FlinkParquetReaders) all apply defaults at the logical-type layer and are unaffected.Proposed PR for the fix: #16501
Willingness to contribute