-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a way to represent constrained statistics / bounds on values in Statistics #8078
Comments
A further wrinkle that may be worth thinking about, even if the solution is not to do anything, is that some formats, including parquet, special case certain floating point values. There is some discussion on apache/parquet-format#196 to try to make the statistics actually usable, and one potential outcome might be to just use totalOrder, so it may even be no action is required, but just an FYI |
I have two proposals: If we have the cases where the lower bound of a range might be inexact (or perhaps absent) and the upper bound is exact—or vice versa—we could define the following:
|
@berkaysynnada under what circumstances would we use |
Actually, upon more thought I think the best next step here is for me to prototype what this would look like. I'll do that and report back |
You are right, I cannot find any case to use Inexact. Any exact value can be relaxed to positive or negative side using intervals. |
In proposal two, you still need On the contrary: If you have BTW @berkaysynnada's first proposal is the more general one: It allows range estimations where either bound can be exact or inexact estimations. His second proposal is slightly simpler to implement (and leverages the interval library), but it constrains range estimations to exact bounds. When we decide to support slightly more complicated filters, say |
I think there is a mismatch between the current use of Specifically, For example the comments on However, it is used to skip processing files when the (inexact) row count is above the fetch/limit: I am pretty sure this is only valid if the estimate of the number of rows is conservative (aka the real value is at least as large as the statistics), but the code uses the value even for Another example is |
Another challenge I found with using I will try prototyping option 1 sugested by @berkaysynnada #8078 (comment) and see what I find |
I think the documentation is correct but you have found 2 bugs in the code:
I will open a PR now fixing these issues. |
Just so you know, I have completed the interval refactor and submitted it for our internal review. After that, interval arithmetic will be in |
Any thoughts on @berkaysynnada's proposal; i.e.
I think extending the BTW this exercise revealed that the name |
I really like it. My plan was spend time trying to code up some of these proposals and see if it can fit in the existing framework (and thus potentially reveal in what cases Inexact would be used), etc However, I spent much more of my time this week on apache/arrow-rs#5050 and related discussions around #7994 and #8017 than I had planned, so I am sadly behind I hope to make progress tomorrow |
It is unlikely I will have time to work on this today -- I will pick it up next week |
I have been playing and studying this code. While the suggestion from @ozankabak and @berkaysynnada in #8078 (comment) is very general and can represent many types of uncertainty in statistics, I haven't found cases yet where that full generality is important For example, I can't find (nor think of) an important case where the lower bound would be known with certainty and the upper bound was uncertain vs TYPE::MAX). Another example would be a use case where distinguishing between ranges like
Thus I am going to prototype what adding I also plan to encapsulate more of the checks into
I'll report back here with how it goes shorty |
Do we want to support something like this in the short term? Then we would need inexact bounds on both sides. |
Ok. Thank you for this point. I am still working through exactly what "inexact" means / when it would be used, I probably misunderstand. I plan to work on this ticket in several PRs. In the first few I plan to centralize the use of |
After some thought, I think the key information we are trying to capture in statistics are:
I don't think the proposal (at least as I understand it) in #8078 (comment), can capture the known min/max values However, perhaps something like this could: pub enum Precision<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
/// The exact value is known
Exact(T),
/// The exact value is not known, but the real value is **known** to be within
/// the specified range: `lower <= value <= upper` and furthermore is estimated
/// to be `estimate`. However, the real value may fall anywhere in the range
/// TODO: we could use `Interval` here instead, which could represent more complex cases (like
/// open/closed bounds rather than using T::MIN/T::MAX)
Bounded { lower: T, estimate: T upper: T},
/// Nothing is known about the value
#[default]
Absent,
} Maybe estimate should be Example1: Filter
|
Thank you for investigating this. There are some parts I don't quite follow. Particularly,
I don't think this is entirely accurate. In addition to the two configurations you identified, it can also be: Precision::Range(
PointEstimate::Exact(0), // min
PointEstimate::Exact(N), // max
) which captures the known min/max values. However, this loses our inexact guess Example 2
|
This has come up again recently -- I started a discussion here Suremarc has a PR here: |
Now that the interval library is in pub enum Estimate {
Range { bounds: Interval, value: ScalarValue },
/// Nothing is known about the value
#[default]
Absent,
} This would be great to have and would support all use cases as far as I can tell. |
There was a bunch of discussion on #13293 that I would like to try and summarize: Trying to combine the proposals from @crepererum Changes to
|
I like the incremental approach and the ultimate desiderata on functionality, but I have some concerns on the "how" part. I will share my thoughts below, hopefully we will get to a neat design that we will not need to change/mess with again relatively soon. PrecisionThere are two related, but independent concepts we need to track here:
(1) and (2) have different use cases. We typically use (1) for optimization (join algorithm and/or side selection is a good example), and we end up with performance gains when estimations are accurate. If not, we may end up with bad performance, but we shouldn't get incorrect results. However, we use (2) for more critical decisions, like pruning data structures (e.g. trimming hash tables because we know a match has become impossible for a section of the table). A problem in (2) causes incorrect results to be generated. (1) and (2) are not mutually exclusive. Sometimes we have info on both, sometimes only one but not the other, sometimes we don't know anything about either. The type definition as stated currently doesn't accommodate this, but this is an easy fix. Column statistics vs Expression evaluationEvaluating an expression for bounds is a "low-level" concept - a slight generalization of normal expression evaluation where one uses interval arithmetic (or another range computation technique) instead of ordinary arithmetic. There are many use cases where this is what is required, and other statistical concepts such as distinct count and/or null count is not even applicable. We actually have use cases such as this at Synnada. One can discuss what the range of an expression is without any dataset at hand -- just ranges of each symbol is enough. On the other hand, computing column statistics is a "higher-level" task where computing the range is just one component of what is sought after. All in all, I don't currently think merging these things is a good idea. I see that a mechanism for "stats in -> stats out" kind of computations seems to be necessary, but IMO any such mechanism should use |
BTW, we would be happy to devote some person-hours to contribute to this project and get it over the finish line. We happen to have thought (and worked on) quite intensively on this subject and would be happy to help contribute some of those understandings/code back to upstream DF community. |
I think this is a great insight. Maybe the problem is that we are trying to overload The
What is the easy fix you envision? One thing I could think of would be to add a structure similar to /// Known bounds for a set of columns
///
/// This information is guaranteed to be correct and is used for
/// optimization and correctness transformations. See [`Statistics`] for
///
struct Bounds {
/// Interval, if known, for each output column
inner: Vec<Option<Interval>>,
} However,
This also makes sense (that statistics propagation is a higher level concept that naturally builds on One way maybe we could support the usecase of code that called the existing evaluate_bounds function with the new API could be to make it easy to a For example, given existing code like let input_intervals: Vec<&Interval> = ....;
let output_interval = expr.evaluate_bounds(&input_itervals)?;
... Might look something like let input_intervals: Vec<&Interval> = ....;
// wrap input intervals with Statistics
let temp_statistics = Statistics::new_from_bounds(&input_intervals);
// compute output column statistics
let output_column_statistics = expr.column_statistics(&temp_statistics)?;
// use the output value if it was known
let output_interval. = match output_column_statstics.value() {
Precision::Absent | Precision::PointEstimation => None,
Precision::Interval(interval) => interval
}; |
Having attempted to implement a new struct Precision<T> {
lower: Option<T>,
upper: Option<T>,
point_estimate: Option<T>,
} I did indeed notice that just about all of the code dealing with cardinality estimates ( Notwithstanding, I had already taken a stab at implementing this layout for pub struct ColumnStatistics {
pub null_count: Precision<usize>,
// The value (or range) that this column takes on
pub value: Precision<ScalarValue>,
pub distinct_count: Precision<usize>,
} As far as I can gather from looking at the codebase, current use cases for column statistics center around cardinality estimates ( The other option is to keep |
let input_intervals: Vec<&Interval> = ....;
// wrap input intervals with Statistics
let temp_statistics = Statistics::new_from_bounds(&input_intervals);
// compute output column statistics
let output_column_statistics = expr.column_statistics(&temp_statistics)?;
// use the output value if it was known
let output_interval. = match output_column_statstics.value() {
Precision::Absent | Precision::PointEstimation => None,
Precision::Interval(interval) => interval
}; This usage is weird in contexts where the concept of statistics isn't even applicable. I agree that it will work, but only so because we are forcing. I think the right pattern is to have So I see something like |
FWIW there is a current move to add statistics into the arrow format itself: I actually think we could standardize on converting to/from that format. I don't think the arrow proposal handles the subtelty about intervals, known facts, etc but we should at least be aware of them (thanks @edmondop for pointing this out to me) |
Definitely. We will probably leverage that information to construct our own BTW we will start working on this tomorrow. |
Is your feature request related to a problem or challenge?
This has come up a few times, most recently in discussions with @berkaysynnada on apache/arrow-rs#5037 (comment)
Usecase 1 is that for large binary/string columns, formats like parquet allow storing a truncated value that does not actually appear in the data. Given that values are stored in the min/max metadata, storing truncated values keeps the size of metadata down
For example, for a string column that has very long values, it requires much less space to store a short value slightly lower than the actual minimum as the "minimum" statistics value, and one that is slightly higher than the actual maximum as the "maximum" statistics value.
For example:
aaa......z
qqq......q
a
r
There is a similar usecase when applying a Filter, as described by @korowa on #5646 (comment) and we have a similar one in IOx where the operator may remove values, but won't decrease the minimum value or increase the maximum value in any column
Currently
Precision
only representsExact
andInexact
, there is no way to represent "unexact, but bounded above/below"Describe the solution you'd like
Per @berkaysynnada I propose changing
Precision::Inexact
to a new variantPrecision::Between
which would store anInterval
of known min/maxes of the value.This is a quite general formulation, and it can describe "how" inexact the values are.
This would have the benefit of being very expressive (Intervals can represent open/closed bounds, etc)
Describe alternatives you've considered
There is also a possibility of introducing a simpler, but more limited version of these statistics, like:
Additional context
No response
The text was updated successfully, but these errors were encountered: