Skip to content

Core, Spark: Performant queries over (shredded) Variant data #16172

@steveloughran

Description

@steveloughran

Issue to group together everything needed for queries over Variant data to work well.

This is part of #10392

  1. Auto generation of shredded fields.
  2. Unmarshalling performance.
  3. Rowgroup and file skipping based on shredded field stats.
  4. Benchmarks to evaluate this

Iceberg query performance relies on spark to pass down variant_get() calls to the rowgroup filter, so the changes are interrelated. This stuff will have to target spark 4.2 only

Proposal Document

Iceberg

#14297 Spark: Support writing shredded variant in Iceberg-Spark
#15510 Parquet Rowgroup skipping for variant predicate
#15384 Api: Support variant extract and fix manifest bounds byte order
#15385 Spark: Support variant_get predicate pushdown for file skipping
#15628 Core, Spark: Add JMH benchmarks for Variants

  • skip files on iceberg stats, if possible.

Spark

  • 54598 Enable Parquet rowgroup skipping for variant filters
  • 54394
    Support variant_get predicate for DSv2 filter pushdown

Parquet: better unmarshalling

Query engine

Spark

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    improvementPR that improves existing functionality

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions