HIVE-28363: Improve heuristics of FilterStatsRule without column stats #5337

okumin · 2024-07-08T08:26:04Z

What changes were proposed in this pull request?

Make FilterStatsRule reduce the number of filtered rows by half when # of distinct values is empty.
https://issues.apache.org/jira/browse/HIVE-28363

Why are the changes needed?

Simply, col IN (5) makes the estimated amount half, col IN (5, 10) keeps 100% of stats, col IN (5, 10, 15) keeps 100% of stats, and so on. I expect col IN ({arbitrary number of constants}) to filter out rows to some extent in almost all cases.

FilterStatsRule roughly estimates the number of rows filtered by IN to be {Original # of rows} * {1 / cardinality} * {# of values in IN}. The second term is estimated as 0.5 when column stats are unavailable. So, it always returns the original number when IN retains two constant values like col IN (1, 3).

Maybe, FilterStatsRule had behaved in the same way as this PR before, but this change slightly changed the formula to cover a special case. We will likely prefer the original formula when columns stats are not available.

Does this PR introduce any user-facing change?

No.

Is the change a dependency upgrade?

No.

How was this patch tested?

CREATE TABLE users (id INT);
INSERT INTO users VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
set hive.fetch.task.conversion=none;
set hive.stats.fetch.column.stats=false;
EXPLAIN SELECT * FROM users WHERE id IN (1);
EXPLAIN SELECT * FROM users WHERE id IN (1, 2);

sonarcloud · 2024-07-09T00:37:46Z

Quality Gate passed

Issues
4 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

kgyrtkirk · 2024-07-17T12:28:15Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java

@@ -557,14 +557,17 @@ private long evaluateInExpr(Statistics stats, ExprNodeDesc pred, long currNumRow
      }
      for (int i = 0; i < columnStats.size(); i++) {
        long dvs = columnStats.get(i) == null ? 0 : columnStats.get(i).getCountDistint();
-        long intersectionSize = estimateIntersectionSize(aspCtx.getConf(), columnStats.get(i), values.get(i));
+        if (dvs == 0) {
+          factor *= 0.5;


I think you could possibly introduce a hiveconf key for setting this parameter as well; I guess a default of .1 would be even better...

Can we use this property to tune the factor? It could not directly support your original intention, but it will likely work as well or even better.

sure you could use that as well - that's also better than burning in.5

@kgyrtkirk Can I keep the current revision? The parameter is multiplied at the end of the overall calculation. I guess it is more desirable than multiplying the arbitrary parameter in this loop. If we use 0.1 in this loop, and if the size of columnStats is 3, the final factor will be 0.1 * 0.1 * 0.1. Maybe, it is overkilling

@kgyrtkirk Would you mind accepting the proposal and merging this PR? Thanks.

okumin · 2024-07-24T09:24:15Z

@kgyrtkirk Can we merge this one or do I need something to fix? Thanks!

sonarcloud · 2024-09-19T11:54:19Z

Quality Gate passed

Issues
4 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

okumin · 2024-09-20T09:32:38Z

I rebased this branch on the current master yesterday

zhangbutao

LGTM +1
Based on the provided&changed test cases, i think this change is reasonable, especially for the external tables without column stats.

IMO, statistics are only a rough estimate, and we can not make it perfect. So we can get this change in first, and then continue to optimize it.

BTW, we can also consider to how to evaluate Iceberg table stats to optimize the queries in the StatsRulesProcFactory. This can be done in the future work.

Hi, @kgyrtkirk, do you have any other comments? :)
Thanks.

zhangbutao · 2024-09-28T02:47:21Z

@kgyrtkirk I just merged the change. If you have further comments. we can optimize it later. :)
Thanks.

okumin · 2024-09-29T06:31:42Z

Thank you, both!

asf-ci-hive added the tests pending label Jul 8, 2024

okumin changed the title ~~HIVE-28363: Improve heuristics of FilterStatsRule without column stats~~ [WIP] HIVE-28363: Improve heuristics of FilterStatsRule without column stats Jul 8, 2024

okumin marked this pull request as draft July 8, 2024 09:45

asf-ci-hive added tests unstable and removed tests pending labels Jul 8, 2024

okumin force-pushed the HIVE-28363-filter-stats branch from 4347f79 to d99a94a Compare July 8, 2024 15:08

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Jul 8, 2024

okumin force-pushed the HIVE-28363-filter-stats branch from d99a94a to 18355a2 Compare July 8, 2024 23:29

asf-ci-hive added tests pending and removed tests unstable labels Jul 8, 2024

asf-ci-hive added tests passed and removed tests pending labels Jul 9, 2024

okumin changed the title ~~[WIP] HIVE-28363: Improve heuristics of FilterStatsRule without column stats~~ HIVE-28363: Improve heuristics of FilterStatsRule without column stats Jul 9, 2024

okumin marked this pull request as ready for review July 9, 2024 09:28

kgyrtkirk approved these changes Jul 17, 2024

View reviewed changes

HIVE-28363: Improve heuristics of FilterStatsRule without column stats

6f0af03

okumin force-pushed the HIVE-28363-filter-stats branch from 18355a2 to 6f0af03 Compare September 19, 2024 09:14

asf-ci-hive added tests pending and removed tests passed labels Sep 19, 2024

asf-ci-hive added tests passed and removed tests pending labels Sep 19, 2024

zhangbutao approved these changes Sep 26, 2024

View reviewed changes

zhangbutao merged commit b058e3d into apache:master Sep 28, 2024
6 checks passed

okumin deleted the HIVE-28363-filter-stats branch September 29, 2024 06:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-28363: Improve heuristics of FilterStatsRule without column stats #5337

HIVE-28363: Improve heuristics of FilterStatsRule without column stats #5337

okumin commented Jul 8, 2024 •

edited

Loading

sonarcloud bot commented Jul 9, 2024

kgyrtkirk Jul 17, 2024

okumin Jul 17, 2024 •

edited

Loading

kgyrtkirk Jul 24, 2024

okumin Jul 25, 2024

okumin Sep 10, 2024

okumin commented Jul 24, 2024

sonarcloud bot commented Sep 19, 2024

okumin commented Sep 20, 2024

zhangbutao left a comment

zhangbutao commented Sep 28, 2024

okumin commented Sep 29, 2024

HIVE-28363: Improve heuristics of FilterStatsRule without column stats #5337

HIVE-28363: Improve heuristics of FilterStatsRule without column stats #5337

Conversation

okumin commented Jul 8, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

sonarcloud bot commented Jul 9, 2024

Quality Gate passed

kgyrtkirk Jul 17, 2024

Choose a reason for hiding this comment

okumin Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

kgyrtkirk Jul 24, 2024

Choose a reason for hiding this comment

okumin Jul 25, 2024

Choose a reason for hiding this comment

okumin Sep 10, 2024

Choose a reason for hiding this comment

okumin commented Jul 24, 2024

sonarcloud bot commented Sep 19, 2024

Quality Gate passed

okumin commented Sep 20, 2024

zhangbutao left a comment

Choose a reason for hiding this comment

zhangbutao commented Sep 28, 2024

okumin commented Sep 29, 2024

okumin commented Jul 8, 2024 •

edited

Loading

okumin Jul 17, 2024 •

edited

Loading