-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28363: Improve heuristics of FilterStatsRule without column stats #5337
Conversation
4347f79
to
d99a94a
Compare
d99a94a
to
18355a2
Compare
Quality Gate passedIssues Measures |
@@ -557,14 +557,17 @@ private long evaluateInExpr(Statistics stats, ExprNodeDesc pred, long currNumRow | |||
} | |||
for (int i = 0; i < columnStats.size(); i++) { | |||
long dvs = columnStats.get(i) == null ? 0 : columnStats.get(i).getCountDistint(); | |||
long intersectionSize = estimateIntersectionSize(aspCtx.getConf(), columnStats.get(i), values.get(i)); | |||
if (dvs == 0) { | |||
factor *= 0.5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could possibly introduce a hiveconf key for setting this parameter as well; I guess a default of .1
would be even better...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use this property to tune the factor? It could not directly support your original intention, but it will likely work as well or even better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure you could use that as well - that's also better than burning in.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgyrtkirk Can I keep the current revision? The parameter is multiplied at the end of the overall calculation. I guess it is more desirable than multiplying the arbitrary parameter in this loop. If we use 0.1 in this loop, and if the size of columnStats
is 3, the final factor will be 0.1 * 0.1 * 0.1. Maybe, it is overkilling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgyrtkirk Would you mind accepting the proposal and merging this PR? Thanks.
@kgyrtkirk Can we merge this one or do I need something to fix? Thanks! |
18355a2
to
6f0af03
Compare
Quality Gate passedIssues Measures |
I rebased this branch on the current master yesterday |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
Based on the provided&changed test cases, i think this change is reasonable, especially for the external tables without column stats.
IMO, statistics are only a rough estimate, and we can not make it perfect. So we can get this change in first, and then continue to optimize it.
BTW, we can also consider to how to evaluate Iceberg table stats to optimize the queries in the StatsRulesProcFactory
. This can be done in the future work.
Hi, @kgyrtkirk, do you have any other comments? :)
Thanks.
@kgyrtkirk I just merged the change. If you have further comments. we can optimize it later. :) |
Thank you, both! |
What changes were proposed in this pull request?
Make FilterStatsRule reduce the number of filtered rows by half when # of distinct values is empty.
https://issues.apache.org/jira/browse/HIVE-28363
Why are the changes needed?
Simply,
col IN (5)
makes the estimated amount half,col IN (5, 10)
keeps 100% of stats,col IN (5, 10, 15)
keeps 100% of stats, and so on. I expectcol IN ({arbitrary number of constants})
to filter out rows to some extent in almost all cases.FilterStatsRule roughly estimates the number of rows filtered by IN to be
{Original # of rows} * {1 / cardinality} * {# of values in IN}
. The second term is estimated as 0.5 when column stats are unavailable. So, it always returns the original number whenIN
retains two constant values likecol IN (1, 3)
.Maybe, FilterStatsRule had behaved in the same way as this PR before, but this change slightly changed the formula to cover a special case. We will likely prefer the original formula when columns stats are not available.
Does this PR introduce any user-facing change?
No.
Is the change a dependency upgrade?
No.
How was this patch tested?