HIVE-25948: Iceberg: Enable cost-based selection between Fanout and Clustered writers using column stats NDV by deniskuzZ · Pull Request #6389 · apache/hive

deniskuzZ · 2026-03-24T20:07:23Z

What changes were proposed in this pull request?

Cost-based selection between Fanout and Clustered writers

Why are the changes needed?

Perf optimization

Does this PR introduce any user-facing change?

No

How was this patch tested?

mvn test -Dtest=TestIcebergCliDriver -Dqfile=dynamic_partition_writes.q

┌───────────────────────┬───────────────────────────┐
│       Scenario        │         Expected          │
├───────────────────────┼───────────────────────────┤
│ threshold=0 (default) │ no sort (NDV<MAX_WRITERS) │
├───────────────────────┼───────────────────────────┤
│ threshold=-1          │ no sort                   │
├───────────────────────┼───────────────────────────┤
│ threshold=1           │ sort                      │
├───────────────────────┼───────────────────────────┤
│ threshold=2           │ sort (NDV>2)              │
├───────────────────────┼───────────────────────────┤
│ threshold=100         │ no sort (NDV<=100)        │
├───────────────────────┼───────────────────────────┤
│ fanout=false          │ sort                      │
└───────────────────────┴───────────────────────────┘

…lustered writers using column stats NDV

sonarqubecloud · 2026-04-03T18:45:29Z

Quality Gate passed

Issues
10 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

okumin · 2026-04-05T04:08:39Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java

+      // (e.g. Iceberg) knows the input is ordered and can use a clustered writer.
+      if (!customPartitionExprs.isEmpty()) {
+        dpCtx.setHasCustomPartitionOrSortExpression(true);
+      }


Should this be located after all the bailouts are evaluated? It means we may mutate the context after if (!removeRSInsertedByEnforceBucketing(fsOp)) { finished.

okumin · 2026-04-06T10:59:59Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java

+          if (partStats == null) {
+            return -1;
+          }
+          cardinality *= partStats.getCountDistint();


I might be wrong. I presume this evaluates iceberg_bucket(ndv_100, 8 (=num buckets)) as 100. As iceberg_* always narrows the source column, the current implementation is likely to degrade performance for any workload.
This also means we have a chance to further optimize the cost-based optimizer. Since the cardinality of iceberg_bucket(x, 8) is obviously min(cardinality(x), 8), we can enable the optimization for more cases.

I guess we can achieve the optimization if we implement the following API in iceberg_* UDFs and resolve more accurate cardinality in this computePartCardinality method.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/StatEstimator.java

asf-ci-hive added the tests pending label Mar 24, 2026

deniskuzZ changed the title ~~HIVE-25948: Enable cost-based selection between FanoutWriter and ClusteredWriter for Iceberg tables based on column stats NDV~~ HIVE-25948: Iceberg: Enable cost-based selection between FanoutWriter and ClusteredWriter based on column stats NDV Mar 24, 2026

deniskuzZ force-pushed the HIVE-25948 branch from d9c9da8 to 01fef8e Compare March 24, 2026 20:09

HIVE-25948: Iceberg: Enable cost-based selection between Fanout and C…

f661d63

…lustered writers using column stats NDV

deniskuzZ force-pushed the HIVE-25948 branch from 01fef8e to f661d63 Compare March 24, 2026 20:11

deniskuzZ changed the title ~~HIVE-25948: Iceberg: Enable cost-based selection between FanoutWriter and ClusteredWriter based on column stats NDV~~ HIVE-25948: Iceberg: Enable cost-based selection between Fanout and Clustered writers using column stats NDV Mar 24, 2026

asf-ci-hive added tests failed tests pending tests unstable and removed tests pending tests failed labels Mar 24, 2026

fix

43f2cb7

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Apr 2, 2026

sort dynamic partitions by default in some qtests to avoid q.out changes

6a8649b

deniskuzZ force-pushed the HIVE-25948 branch from 0e5a5cf to 6a8649b Compare April 2, 2026 20:08

asf-ci-hive added tests pending and removed tests unstable labels Apr 2, 2026

deniskuzZ requested a review from okumin April 2, 2026 20:10

asf-ci-hive added tests unstable tests pending and removed tests pending tests unstable labels Apr 2, 2026

asf-ci-hive added tests passed and removed tests pending labels Apr 3, 2026

okumin reviewed Apr 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-25948: Iceberg: Enable cost-based selection between Fanout and Clustered writers using column stats NDV#6389

HIVE-25948: Iceberg: Enable cost-based selection between Fanout and Clustered writers using column stats NDV#6389
deniskuzZ wants to merge 3 commits intoapache:masterfrom
deniskuzZ:HIVE-25948

deniskuzZ commented Mar 24, 2026 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Apr 3, 2026

Uh oh!

okumin Apr 5, 2026

Uh oh!

okumin Apr 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

deniskuzZ commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sonarqubecloud bot commented Apr 3, 2026

Quality Gate passed

Uh oh!

okumin Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

okumin Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

deniskuzZ commented Mar 24, 2026 •

edited

Loading

okumin Apr 6, 2026 •

edited

Loading