HIVE-25948: Iceberg: Enable cost-based selection between Fanout and Clustered writers using column stats NDV#6389
HIVE-25948: Iceberg: Enable cost-based selection between Fanout and Clustered writers using column stats NDV#6389deniskuzZ wants to merge 3 commits intoapache:masterfrom
Conversation
…lustered writers using column stats NDV
|
| // (e.g. Iceberg) knows the input is ordered and can use a clustered writer. | ||
| if (!customPartitionExprs.isEmpty()) { | ||
| dpCtx.setHasCustomPartitionOrSortExpression(true); | ||
| } |
There was a problem hiding this comment.
Should this be located after all the bailouts are evaluated? It means we may mutate the context after if (!removeRSInsertedByEnforceBucketing(fsOp)) { finished.
| if (partStats == null) { | ||
| return -1; | ||
| } | ||
| cardinality *= partStats.getCountDistint(); |
There was a problem hiding this comment.
I might be wrong. I presume this evaluates iceberg_bucket(ndv_100, 8 (=num buckets)) as 100. As iceberg_* always narrows the source column, the current implementation is likely to degrade performance for any workload.
This also means we have a chance to further optimize the cost-based optimizer. Since the cardinality of iceberg_bucket(x, 8) is obviously min(cardinality(x), 8), we can enable the optimization for more cases.
I guess we can achieve the optimization if we implement the following API in iceberg_* UDFs and resolve more accurate cardinality in this computePartCardinality method.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/StatEstimator.java



What changes were proposed in this pull request?
Cost-based selection between Fanout and Clustered writers
Why are the changes needed?
Perf optimization
Does this PR introduce any user-facing change?
No
How was this patch tested?
mvn test -Dtest=TestIcebergCliDriver -Dqfile=dynamic_partition_writes.q