Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28363: Improve heuristics of FilterStatsRule without column stats #5337

Merged
merged 1 commit into from
Sep 28, 2024

Conversation

okumin
Copy link
Contributor

@okumin okumin commented Jul 8, 2024

What changes were proposed in this pull request?

Make FilterStatsRule reduce the number of filtered rows by half when # of distinct values is empty.
https://issues.apache.org/jira/browse/HIVE-28363

Why are the changes needed?

Simply, col IN (5) makes the estimated amount half, col IN (5, 10) keeps 100% of stats, col IN (5, 10, 15) keeps 100% of stats, and so on. I expect col IN ({arbitrary number of constants}) to filter out rows to some extent in almost all cases.

FilterStatsRule roughly estimates the number of rows filtered by IN to be {Original # of rows} * {1 / cardinality} * {# of values in IN}. The second term is estimated as 0.5 when column stats are unavailable. So, it always returns the original number when IN retains two constant values like col IN (1, 3).

Maybe, FilterStatsRule had behaved in the same way as this PR before, but this change slightly changed the formula to cover a special case. We will likely prefer the original formula when columns stats are not available.

Does this PR introduce any user-facing change?

No.

Is the change a dependency upgrade?

No.

How was this patch tested?

CREATE TABLE users (id INT);
INSERT INTO users VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
set hive.fetch.task.conversion=none;
set hive.stats.fetch.column.stats=false;
EXPLAIN SELECT * FROM users WHERE id IN (1);
EXPLAIN SELECT * FROM users WHERE id IN (1, 2);

@okumin okumin changed the title HIVE-28363: Improve heuristics of FilterStatsRule without column stats [WIP] HIVE-28363: Improve heuristics of FilterStatsRule without column stats Jul 8, 2024
@okumin okumin marked this pull request as draft July 8, 2024 09:45
Copy link

sonarcloud bot commented Jul 9, 2024

@okumin okumin changed the title [WIP] HIVE-28363: Improve heuristics of FilterStatsRule without column stats HIVE-28363: Improve heuristics of FilterStatsRule without column stats Jul 9, 2024
@okumin okumin marked this pull request as ready for review July 9, 2024 09:28
@@ -557,14 +557,17 @@ private long evaluateInExpr(Statistics stats, ExprNodeDesc pred, long currNumRow
}
for (int i = 0; i < columnStats.size(); i++) {
long dvs = columnStats.get(i) == null ? 0 : columnStats.get(i).getCountDistint();
long intersectionSize = estimateIntersectionSize(aspCtx.getConf(), columnStats.get(i), values.get(i));
if (dvs == 0) {
factor *= 0.5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could possibly introduce a hiveconf key for setting this parameter as well; I guess a default of .1 would be even better...

Copy link
Contributor Author

@okumin okumin Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use this property to tune the factor? It could not directly support your original intention, but it will likely work as well or even better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure you could use that as well - that's also better than burning in.5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgyrtkirk Can I keep the current revision? The parameter is multiplied at the end of the overall calculation. I guess it is more desirable than multiplying the arbitrary parameter in this loop. If we use 0.1 in this loop, and if the size of columnStats is 3, the final factor will be 0.1 * 0.1 * 0.1. Maybe, it is overkilling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgyrtkirk Would you mind accepting the proposal and merging this PR? Thanks.

@okumin
Copy link
Contributor Author

okumin commented Jul 24, 2024

@kgyrtkirk Can we merge this one or do I need something to fix? Thanks!

Copy link

sonarcloud bot commented Sep 19, 2024

@okumin
Copy link
Contributor Author

okumin commented Sep 20, 2024

I rebased this branch on the current master yesterday

Copy link
Contributor

@zhangbutao zhangbutao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1
Based on the provided&changed test cases, i think this change is reasonable, especially for the external tables without column stats.

IMO, statistics are only a rough estimate, and we can not make it perfect. So we can get this change in first, and then continue to optimize it.

BTW, we can also consider to how to evaluate Iceberg table stats to optimize the queries in the StatsRulesProcFactory. This can be done in the future work.

Hi, @kgyrtkirk, do you have any other comments? :)
Thanks.

@zhangbutao zhangbutao merged commit b058e3d into apache:master Sep 28, 2024
6 checks passed
@zhangbutao
Copy link
Contributor

@kgyrtkirk I just merged the change. If you have further comments. we can optimize it later. :)
Thanks.

@okumin okumin deleted the HIVE-28363-filter-stats branch September 29, 2024 06:31
@okumin
Copy link
Contributor Author

okumin commented Sep 29, 2024

Thank you, both!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants