Issue to group together everything needed for queries over Variant data to work well.
This is part of #10392
- Auto generation of shredded fields.
- Unmarshalling performance.
- Rowgroup and file skipping based on shredded field stats.
- Benchmarks to evaluate this
Iceberg query performance relies on spark to pass down variant_get() calls to the rowgroup filter, so the changes are interrelated. This stuff will have to target spark 4.2 only
Proposal Document
Iceberg
#14297 Spark: Support writing shredded variant in Iceberg-Spark
#15510 Parquet Rowgroup skipping for variant predicate
#15384 Api: Support variant extract and fix manifest bounds byte order
#15385 Spark: Support variant_get predicate pushdown for file skipping
#15628 Core, Spark: Add JMH benchmarks for Variants
- skip files on iceberg stats, if possible.
Spark
- 54598 Enable Parquet rowgroup skipping for variant filters
- 54394
Support variant_get predicate for DSv2 filter pushdown
Parquet: better unmarshalling
Query engine
Spark
Willingness to contribute
Issue to group together everything needed for queries over Variant data to work well.
This is part of #10392
Iceberg query performance relies on spark to pass down variant_get() calls to the rowgroup filter, so the changes are interrelated. This stuff will have to target spark 4.2 only
Proposal Document
Iceberg
#14297 Spark: Support writing shredded variant in Iceberg-Spark
#15510 Parquet Rowgroup skipping for variant predicate
#15384 Api: Support variant extract and fix manifest bounds byte order
#15385 Spark: Support variant_get predicate pushdown for file skipping
#15628 Core, Spark: Add JMH benchmarks for Variants
Spark
Support variant_get predicate for DSv2 filter pushdown
Parquet: better unmarshalling
Query engine
Spark
Willingness to contribute