Skip to content

HIVE-25948: Iceberg: Enable cost-based selection between Fanout and Clustered writers using column stats NDV#6389

Open
deniskuzZ wants to merge 3 commits intoapache:masterfrom
deniskuzZ:HIVE-25948
Open

HIVE-25948: Iceberg: Enable cost-based selection between Fanout and Clustered writers using column stats NDV#6389
deniskuzZ wants to merge 3 commits intoapache:masterfrom
deniskuzZ:HIVE-25948

Conversation

@deniskuzZ
Copy link
Copy Markdown
Member

@deniskuzZ deniskuzZ commented Mar 24, 2026

What changes were proposed in this pull request?

Cost-based selection between Fanout and Clustered writers

Why are the changes needed?

Perf optimization

Does this PR introduce any user-facing change?

No

How was this patch tested?

mvn test -Dtest=TestIcebergCliDriver -Dqfile=dynamic_partition_writes.q

┌───────────────────────┬───────────────────────────┐
│       Scenario        │         Expected          │
├───────────────────────┼───────────────────────────┤
│ threshold=0 (default) │ no sort (NDV<MAX_WRITERS) │
├───────────────────────┼───────────────────────────┤
│ threshold=-1          │ no sort                   │
├───────────────────────┼───────────────────────────┤
│ threshold=1           │ sort                      │
├───────────────────────┼───────────────────────────┤
│ threshold=2           │ sort (NDV>2)              │
├───────────────────────┼───────────────────────────┤
│ threshold=100         │ no sort (NDV<=100)        │
├───────────────────────┼───────────────────────────┤
│ fanout=false          │ sort                      │
└───────────────────────┴───────────────────────────┘

@deniskuzZ deniskuzZ changed the title HIVE-25948: Enable cost-based selection between FanoutWriter and ClusteredWriter for Iceberg tables based on column stats NDV HIVE-25948: Iceberg: Enable cost-based selection between FanoutWriter and ClusteredWriter based on column stats NDV Mar 24, 2026
@deniskuzZ deniskuzZ changed the title HIVE-25948: Iceberg: Enable cost-based selection between FanoutWriter and ClusteredWriter based on column stats NDV HIVE-25948: Iceberg: Enable cost-based selection between Fanout and Clustered writers using column stats NDV Mar 24, 2026
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 3, 2026

// (e.g. Iceberg) knows the input is ordered and can use a clustered writer.
if (!customPartitionExprs.isEmpty()) {
dpCtx.setHasCustomPartitionOrSortExpression(true);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be located after all the bailouts are evaluated? It means we may mutate the context after if (!removeRSInsertedByEnforceBucketing(fsOp)) { finished.

if (partStats == null) {
return -1;
}
cardinality *= partStats.getCountDistint();
Copy link
Copy Markdown
Contributor

@okumin okumin Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong. I presume this evaluates iceberg_bucket(ndv_100, 8 (=num buckets)) as 100. As iceberg_* always narrows the source column, the current implementation is likely to degrade performance for any workload.
This also means we have a chance to further optimize the cost-based optimizer. Since the cardinality of iceberg_bucket(x, 8) is obviously min(cardinality(x), 8), we can enable the optimization for more cases.

I guess we can achieve the optimization if we implement the following API in iceberg_* UDFs and resolve more accurate cardinality in this computePartCardinality method.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/StatEstimator.java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants