Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a way to represent constrained statistics / bounds on values in Statistics #8078

Open
alamb opened this issue Nov 7, 2023 · 32 comments · May be fixed by #13293
Open

Introduce a way to represent constrained statistics / bounds on values in Statistics #8078

alamb opened this issue Nov 7, 2023 · 32 comments · May be fixed by #13293
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Nov 7, 2023

Is your feature request related to a problem or challenge?

This has come up a few times, most recently in discussions with @berkaysynnada on apache/arrow-rs#5037 (comment)

Usecase 1 is that for large binary/string columns, formats like parquet allow storing a truncated value that does not actually appear in the data. Given that values are stored in the min/max metadata, storing truncated values keeps the size of metadata down

For example, for a string column that has very long values, it requires much less space to store a short value slightly lower than the actual minimum as the "minimum" statistics value, and one that is slightly higher than the actual maximum as the "maximum" statistics value.

For example:

actual min in data actual max in data "min" value in statistics "max" value in statistics
aaa......z qqq......q a r

There is a similar usecase when applying a Filter, as described by @korowa on #5646 (comment) and we have a similar one in IOx where the operator may remove values, but won't decrease the minimum value or increase the maximum value in any column

Currently Precision only represents Exact and Inexact, there is no way to represent "unexact, but bounded above/below"

Describe the solution you'd like

Per @berkaysynnada I propose changing Precision::Inexact to a new variant Precision::Between which would store an Interval of known min/maxes of the value.

enum Precision {
  ...
  /// The value is known to be in the specified interval
  Between(Interval)
}

This is a quite general formulation, and it can describe "how" inexact the values are.

This would have the benefit of being very expressive (Intervals can represent open/closed bounds, etc)

Describe alternatives you've considered

There is also a possibility of introducing a simpler, but more limited version of these statistics, like:

enum Precision {
  // The value is known to be within the range (it is at at most this large for Max, or at least this large for Min)
  // but the actual values may be lower/higher. 
  Bounded(ScalarValue)
}

Additional context

No response

@alamb alamb added the enhancement New feature or request label Nov 7, 2023
@alamb alamb changed the title Introduce a way to represent constrained staitistics Introduce a way to represent constrained statistics / bounds on values in Statistics Nov 7, 2023
@tustvold
Copy link
Contributor

tustvold commented Nov 7, 2023

A further wrinkle that may be worth thinking about, even if the solution is not to do anything, is that some formats, including parquet, special case certain floating point values. There is some discussion on apache/parquet-format#196 to try to make the statistics actually usable, and one potential outcome might be to just use totalOrder, so it may even be no action is required, but just an FYI

@berkaysynnada
Copy link
Contributor

I have two proposals:

If we have the cases where the lower bound of a range might be inexact (or perhaps absent) and the upper bound is exact—or vice versa—we could define the following:

enum Estimation {
    Exact(T),
    Inexact(T),
    Absent,
}
enum Precision {
    Estimation,
    // min - max
    Range(Estimation, Estimation)
}
  1. If we don't need to consider the exactness of bounds and a simple interval suffices, this alternative could be used:
enum Precision {
    Exact(T),
    Inexact(T),
    Absent,
    Between(Interval)
}

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2023

enum Precision {
    Exact(T),
    Inexact(T),
    Absent,
    Between(Interval)
}

@berkaysynnada under what circumstances would we use Inexact? All the cases I can think of Inexact now would be handled by Between (potentially with unbounded intervals)

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2023

Actually, upon more thought I think the best next step here is for me to prototype what this would look like. I'll do that and report back

@berkaysynnada
Copy link
Contributor

enum Precision {
    Exact(T),
    Inexact(T),
    Absent,
    Between(Interval)
}

@berkaysynnada under what circumstances would we use Inexact? All the cases I can think of Inexact now would be handled by Between (potentially with unbounded intervals)

You are right, I cannot find any case to use Inexact. Any exact value can be relaxed to positive or negative side using intervals.

@ozankabak
Copy link
Contributor

ozankabak commented Nov 7, 2023

In proposal two, you still need Inexact; e.g. cases involving selectivity. Assume that you have a column x whose bounds are [1, 100]. After applying a filter of x <= 10, the row count estimate is an inexact estimate of 0.1 * N where N was the estimate prior to the filter.

On the contrary: If you have Between, you do not need Exact anymore -- An Exact value is simply a Between value with a singleton interval inside.

BTW @berkaysynnada's first proposal is the more general one: It allows range estimations where either bound can be exact or inexact estimations. His second proposal is slightly simpler to implement (and leverages the interval library), but it constrains range estimations to exact bounds.

When we decide to support slightly more complicated filters, say x <= y, we will probably need range estimations where both sides could be inexact. Therefore, I think it makes sense to keep things flexible and go with the first proposal. Obv. happy to discuss other perspectives.

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2023

I think there is a mismatch between the current use of Precision::Inexact and its documentation

Specifically, Precision::Inexact appears to be treated as a "conservative estimate" in several places (which is actually what we need in IOx), so perhaps another option would be to rename Precision::Exact to Precision::Conservative and document that it is a conservative estimate of the actual value 🤔

For example the comments on Precision::Inexact say

https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L32-L33

However, it is used to skip processing files when the (inexact) row count is above the fetch/limit:

https://github.com/apache/arrow-datafusion/blob/87aeef5aa4d8f38a8328f8e51e530e6c9cd9afa9/datafusion/core/src/datasource/statistics.rs#L69-L73

I am pretty sure this is only valid if the estimate of the number of rows is conservative (aka the real value is at least as large as the statistics), but the code uses the value even for Precision::Inexact:

https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L42-L46

Another example is ColumnStatistics::is_singleton, which I think is also only correct if the statistics are conservative (aka the actual min is no lower than the reported min and the actual max is no larger than the reported mx)

https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L281-L286

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2023

Another challenge I found with using Interval as proposed in option 2 of #8078 (comment) is more mechanical but real: Interval is in the datafusion_physical_expr crate but Precision is in datafusion_common, meaning I can't use Interval in Precision without moving code around. Not impossible to do, but a data point.

I will try prototyping option 1 sugested by @berkaysynnada #8078 (comment) and see what I find

@berkaysynnada
Copy link
Contributor

I think there is a mismatch between the current use of Precision::Inexact and its documentation

Specifically, Precision::Inexact appears to be treated as a "conservative estimate" in several places (which is actually what we need in IOx), so perhaps another option would be to rename Precision::Exact to Precision::Conservative and document that it is a conservative estimate of the actual value 🤔

I think the documentation is correct but you have found 2 bugs in the code:

  1. Here, we must ensure the values are exact.
    https://github.com/apache/arrow-datafusion/blob/87aeef5aa4d8f38a8328f8e51e530e6c9cd9afa9/datafusion/core/src/datasource/statistics.rs#L69-L73

  2. Singleton check must be for exact values only.
    https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L281-L286

I will open a PR now fixing these issues.

@berkaysynnada
Copy link
Contributor

Another challenge I found with using Interval as proposed in option 2 of #8078 (comment) is more mechanical but real: Interval is in the datafusion_physical_expr crate but Precision is in datafusion_common, meaning I can't use Interval in Precision without moving code around. Not impossible to do, but a data point.

I will try prototyping option 1 sugested by @berkaysynnada #8078 (comment) and see what I find

Just so you know, I have completed the interval refactor and submitted it for our internal review. After that, interval arithmetic will be in datafusion_expr, and there will be no significant obstacles to moving it to datafusion_common.

@ozankabak
Copy link
Contributor

ozankabak commented Nov 9, 2023

Any thoughts on @berkaysynnada's proposal; i.e.

enum PointEstimate {
    Exact(T),
    Inexact(T),
    Absent,
}
enum Precision {
    PointEstimate,
    Range(PointEstimate, PointEstimate)
}

I think extending the Precision struct like this, along with bugfixes like #8094, will address #8099.

BTW this exercise revealed that the name Precision maybe is not the best name. Maybe we should just use Estimate, which could either be a PointEstimate or a Range, which is tuple of PointEstimates for bounds.

@alamb
Copy link
Contributor Author

alamb commented Nov 9, 2023

Any thoughts on @berkaysynnada's proposal; i.e.

I really like it.

My plan was spend time trying to code up some of these proposals and see if it can fit in the existing framework (and thus potentially reveal in what cases Inexact would be used), etc

However, I spent much more of my time this week on apache/arrow-rs#5050 and related discussions around #7994 and #8017 than I had planned, so I am sadly behind

I hope to make progress tomorrow

@alamb
Copy link
Contributor Author

alamb commented Nov 10, 2023

It is unlikely I will have time to work on this today -- I will pick it up next week

@alamb
Copy link
Contributor Author

alamb commented Nov 14, 2023

I have been playing and studying this code. While the suggestion from @ozankabak and @berkaysynnada in #8078 (comment) is very general and can represent many types of uncertainty in statistics, I haven't found cases yet where that full generality is important

For example, I can't find (nor think of) an important case where the lower bound would be known with certainty and the upper bound was uncertain vs TYPE::MAX).

Another example would be a use case where distinguishing between ranges like

min: `PointEstimate::Absent`, max: `PointEstimate::Precise(value)`
min: PointEstimate::Precise(TYPE::MIN), max: PointEstimate::Precise(value)

Thus I am going to prototype what adding Bounded variant to Precision looks like.

I also plan to encapsulate more of the checks into Precision so that if choose to go with a more general formulation we won't have to change as much of the rest of the code.

pub enum Precision<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
    /// The exact value is known
    Exact(T),
    /// The exact value is not known, but the real value is known to be within
    /// the specified range: `lower <= value <= upper` TOOD: we could use
    /// `Interval` here instead, which could represent more complex cases (like
    /// open/closed bounds)
    Bounded { lower: T, upper: T},
    /// The value is not known exactly, but is likely close to this value.
    /// NOTHING can assumed about the value for cor in this case.
    Inexact(T),
    /// Nothing is known about the value
    #[default]
    Absent,
}

I'll report back here with how it goes shorty

@ozankabak
Copy link
Contributor

When we decide to support slightly more complicated filters, say x <= y, we will probably need range estimations where both sides could be inexact. Therefore, I think it makes sense to keep things flexible and go with the first proposal. Obv. happy to discuss other perspectives.

Do we want to support something like this in the short term? Then we would need inexact bounds on both sides.

@alamb
Copy link
Contributor Author

alamb commented Nov 14, 2023

Do we want to support something like this in the short term? Then we would need inexact bounds on both sides.

Ok. Thank you for this point. I am still working through exactly what "inexact" means / when it would be used, I probably misunderstand.

I plan to work on this ticket in several PRs. In the first few I plan to centralize the use of Precision a bit more (e.g. #8172) so it is clearer how it is used

@alamb
Copy link
Contributor Author

alamb commented Nov 14, 2023

After some thought, I think the key information we are trying to capture in statistics are:

  1. Known min/max values
  2. Estimated values that are imprecise (the result of applying some sort of estimate), but are better than just "somewhere between min/max"

I don't think the proposal (at least as I understand it) in #8078 (comment), can capture the known min/max values

However, perhaps something like this could:

pub enum Precision<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
    /// The exact value is known
    Exact(T),
    /// The exact value is not known, but the real value is **known** to be within
    /// the specified range: `lower <= value <= upper` and furthermore is estimated
    /// to be `estimate`. However, the real value may fall anywhere in the range
    /// TODO: we could use `Interval` here instead, which could represent more complex cases (like
    /// open/closed bounds rather than using T::MIN/T::MAX)
    Bounded { lower: T, estimate: T upper: T},
    /// Nothing is known about the value
    #[default]
    Absent,
}

Maybe estimate should be Option<T> rather than just T 🤔

Example1: Filter x <= 10

In the example of a filter where you know the column x has bounds [1, 100] and the row_count is N.

After applying a filter of x <= 10, assuming an even distribution you get a selectivity of 0.1 , so we can conclude the following about the output row_count:

  1. It is still between 0 and N (the max value hasn't been changed by the filter)
  2. We estimate that the value is near 0.1 * N given our cardinality estimation, but this is just an estimate

As I understand @berkaysynnada 's proposal #8078 (comment), it would represent the output as either as:

Precision::PointEstimate(Inexact(0.1*N))

or

Precision::Range(
  PointEstimate::Exact(0), // min
  PointEstimate::Inexact(0.1 * N), // max
)

But neither captures the fact that we know for sure that row count lies between 0 and N (the max wasn't changed by the filter). Using a Bounded variant could capture this information:

Precision::Bounded {
  lower: 0,
  estimate: 0.1 * N,
  upper: 100
}

Here is the setup in pictures for those who like that

     Output Statistics  for row_count
      value: estimated to be 0.1*N           
      min: Exactly 0           
      max: Exactly 100 (can not possibly be higher)         
                               
             ▲                 
             │                 
┌────────────┴────────────┐    
│                         │    
│       FilterExec        │    
│         x <= 10         │    
│                         │    
└─────────────────────────┘    
             ▲                 
             │                 
             │                 
                               
    Input Statistics for row count:     
     value: Exactly N      
     min: Exactly 0            
     max: Exactly 100          
                               

x <= y example

The other example raised above is a predicate like x <= y. I assume the challenge is now how do we represent the column ranges with this information. Let's say we know that x was between [50, 100] and y was between [0, 200] and there were N input rows

Assuming even and uncorrelated distribution, we would get a selectivity estimate of 0.75

  │                                                      │                      
  │                                                      │                      
  │──────────────────────────────────────────────────────┤       y              
  │                                                      │       min: 0         
  │                                                      │       max: 200       
                                                                                
  0             │                      │                200                     
                │                      │                                        
                ├──────────────────────┤                         x              
                │                      │                         min: 50        
                │                      │                         max: 100       
                                                                                
                0                     200                                       
                                                                                
                                                                                
               **     Only range that can be true      **                       
                **                                    **                        
                 **************************************                         
                                                                                

We could again represent the output statistic with the proper range as well as the cardinality estimate like this, which both captures the fact y can't be less than 50 as well as the 0.75 selectivity):

x: Bounded { lower: 50, estimate: 0.75 * N, upper: 100 }, 
y: Bounded { lower: 50, estimate: 0.75 * N, upper: 200 }

@ozankabak
Copy link
Contributor

Thank you for investigating this. There are some parts I don't quite follow. Particularly,

I don't think the proposal (at least as I understand it) in #8078 (comment), can capture the known min/max values

I don't think this is entirely accurate. In addition to the two configurations you identified, it can also be:

Precision::Range(
  PointEstimate::Exact(0), // min
  PointEstimate::Exact(N), // max
)

which captures the known min/max values. However, this loses our inexact guess 0.1 * N. I think what you are trying to convey is that @berkaysynnada's proposal can not represent the triple of (lower bound, guess, upper bound). If this is what you mean, I agree with you.

Example 2 (x <= y)

x: Bounded { lower: 50, estimate: 0.75 * N, upper: 100 }, 
y: Bounded { lower: 50, estimate: 0.75 * N, upper: 200 }

I don't understand what you mean by this. In this snippet, lower/upper bounds refer to column bounds, while estimate refers to row count. Did you mean to write:

x: Bounded { lower: 50, estimate: 75, upper: 100 }, 
y: Bounded { lower: 50, estimate: 125, upper: 200 }
row_count: Bounded { lower: 0, estimate: 62.5, upper: 100}

where the 62.5 figure is derived from the factor:

image

which is 5/8.

@alamb
Copy link
Contributor Author

alamb commented Nov 8, 2024

@ozankabak
Copy link
Contributor

Now that the interval library is in expr-common, is it possible we can use

pub enum Estimate {
    Range { bounds: Interval, value: ScalarValue },
    /// Nothing is known about the value
    #[default]
    Absent,
}

This would be great to have and would support all use cases as far as I can tell.

@alamb
Copy link
Contributor Author

alamb commented Dec 15, 2024

There was a bunch of discussion on #13293 that I would like to try and summarize:

Trying to combine the proposals from @crepererum
#13293 (comment) and @berkaysynnada on #13293 (comment) and my own hopes that we make it cheaper to manage Statistics:

Changes to Statistics

Statistics

pub struct Statistics {
    pub num_rows: Precision<usize>,
    pub total_byte_size: Precision<usize>,
    pub column_statistics: Vec<ColumnStatistics>,
}

I propose changing this to:

pub struct Statistics {
    pub num_rows: Precision<usize>,
    pub total_byte_size: Precision<usize>,
    pub column_statistics: Vec<Arc<ColumnStatistics>>,
}

Which would allow for quickly copying ColumnStatistics for unchanged columns.

An alternate may be to just use Arc<Statistics> everywhere (or maybe both 🤔 )

Changes to ColumnStatistics

At the time of writing, ColumnStatistics looks like

pub struct ColumnStatistics {
    pub null_count: Precision<usize>,
    pub max_value: Precision<ScalarValue>,
    pub min_value: Precision<ScalarValue>,
    pub distinct_count: Precision<usize>,
}

I believe the proposal involves changing Precision so it can capture the min/max data, which would result in a CoumnStatistics structure like this:

pub struct ColumnStatistics {
    pub null_count: Precision<usize>,
    // The value (or range) that this column takes on
    pub value: Precision<ScalarValue>,
    pub distinct_count: Precision<usize>,
}

Changes to Precision

Some constraints:

  1. Precision today is a template which allows for both Precision<usize> and Precision<ScalarValue>
  2. Interval already provides the ability to do range analysis and is based on ScalarValue

Options 1: Use ScalarValue exclusively

/// ...
enum Precision {
  /// Nothing is known about the value
  Absent,
  /// Estimated, but very close to the given point.
  ///
  /// This can be treated as an over-simplified normal distribution.
  PointEstimation(ScalarValue),
  /// Value is for sure in the given Interval. If min=max the value is known to be a single value
  Interval(Interval),
}

Pros: Can reuse Interval arithmetic and will allow us to hopefully unify Interval and statitics calculations
Cons: Will be less efficient for statistics such as count which we know must be usize

Options 2: Precision remains parameterized on T

/// ...
///
/// # Note
/// Unknown ranges are modeled as `Precision::Range{lower: None, upper: None}`,
enum Precision {
  /// Estimated, but very close to the given point.
  ///
  /// This can be treated as an over-simplified normal distribution.
  PointEstimation(T),
  
  /// Value is for sure in the given open/half-open/closed range.
  ///
  /// If `lower`/`upper` are given, they are INCLUSIVE ends.
  Range {
    lower: Option<T>,
    upper: Option<T>,
  },
}

Cons: More natural analysis for statistics such as count which we know must be usize
Pros: Harder to reuse Interval arithmetic and will allow us to hopefully unify Interval and statistics calculations

Options 3: Change interval to be parameterized on T

/// ...
enum Precision<T> {
  /// Nothing is known about the value
  Absent,
  /// Estimated, but "close" to the given point.
  ///
  /// This can be treated as an over-simplified normal distribution and used for 
  /// Cardinality estimation, for example
  PointEstimation(T),
  /// Value is for sure in the given Interval. If min=max the value is known to be a single value
  Interval(Interval<T>),
}

Cons: The most work/ changes required
Pros: Unifies Intervals and Statistics

Unify Statistics and evaluate_bounds

Today PhysicalExpr::evaluate_bounds computes the output bounds of an expression output using Intervals for the inputs.

trait PhysicalExpr { 
...
fn evaluate_bounds(
    &self,
    _children: &[&Interval],
) -> Result<Interval, DataFusionError>
...

@gatesn proposes adding an API that does something similar in:

His API computes the output ColumnStatistics given the input Statistics, which I believe is a more general concept than just evalute_bounds (as it encompasses counts as well as the value)

trait PhysicalExpr { 
...
    fn column_statistics(&self, _statistics: &Statistics) -> Result<ColumnStatistics> {
...

If we allowed Precision to represent Intervals (Option 3 above)I think we could deprecate evaluate_bounds and simply use column_statistics as it would be a strict superset .

Incremental steps to get us there:

The changes contemplated above are substantial and potentially quite disruptive to downstream APIs.

I propose a phased approach:

  1. Step 1 is to make the fields of Statistics / ColumnStatistics/ Precision not pub and only allow access through accessors (this will permit us to 1) change internal structures with more limited impact and 2) formalize the API by which statistics are manipulated.
  2. Change Interval to be Interval<T> which will allow Interval<usize> types
  3. Change Precision to follow Option 3 above
  4. Change evalute_bounds to column_statistics

@ozankabak
Copy link
Contributor

I like the incremental approach and the ultimate desiderata on functionality, but I have some concerns on the "how" part. I will share my thoughts below, hopefully we will get to a neat design that we will not need to change/mess with again relatively soon.

Precision

There are two related, but independent concepts we need to track here:

  1. Summary statistics; i.e. estimations about location and dispersion (spread).
  2. Definitive mathematical facts, such as the range or a guaranteed superset thereof. In plain English, "hard facts" that can never be violated.

(1) and (2) have different use cases. We typically use (1) for optimization (join algorithm and/or side selection is a good example), and we end up with performance gains when estimations are accurate. If not, we may end up with bad performance, but we shouldn't get incorrect results. However, we use (2) for more critical decisions, like pruning data structures (e.g. trimming hash tables because we know a match has become impossible for a section of the table). A problem in (2) causes incorrect results to be generated.

(1) and (2) are not mutually exclusive. Sometimes we have info on both, sometimes only one but not the other, sometimes we don't know anything about either. The type definition as stated currently doesn't accommodate this, but this is an easy fix.

Column statistics vs Expression evaluation

Evaluating an expression for bounds is a "low-level" concept - a slight generalization of normal expression evaluation where one uses interval arithmetic (or another range computation technique) instead of ordinary arithmetic. There are many use cases where this is what is required, and other statistical concepts such as distinct count and/or null count is not even applicable. We actually have use cases such as this at Synnada. One can discuss what the range of an expression is without any dataset at hand -- just ranges of each symbol is enough.

On the other hand, computing column statistics is a "higher-level" task where computing the range is just one component of what is sought after. All in all, I don't currently think merging these things is a good idea. I see that a mechanism for "stats in -> stats out" kind of computations seems to be necessary, but IMO any such mechanism should use evaluate_bounds as a subroutine.

@ozankabak
Copy link
Contributor

BTW, we would be happy to devote some person-hours to contribute to this project and get it over the finish line. We happen to have thought (and worked on) quite intensively on this subject and would be happy to help contribute some of those understandings/code back to upstream DF community.

@alamb
Copy link
Contributor Author

alamb commented Dec 16, 2024

(1) and (2) are not mutually exclusive. Sometimes we have info on both, sometimes only one but not the other, sometimes we don't know anything about either. The type definition as stated currently doesn't accommodate this, but this is an easy fix.

I think this is a great insight. Maybe the problem is that we are trying to overload Statistics to keep both types of information (statistics and bounds) 🤔

The aggregate_stastics pass in the optimizer uses Statistics for the bounds use case (aka makes transformations that rely on values in Statistics for correctness):

What is the easy fix you envision? One thing I could think of would be to add a structure similar to Statistics but explicitly for tracking the known bounds of a column

/// Known bounds for a set of columns
/// 
/// This information is guaranteed to be correct and is used for
/// optimization and correctness transformations. See [`Statistics`] for 
/// 
struct Bounds {
  /// Interval, if known, for each output column
  inner: Vec<Option<Interval>>,
}

However, aggregate_stastics does use column count (not just value ranges) so we would have to work that in somehow

Column statistics vs Expression evaluation

This also makes sense (that statistics propagation is a higher level concept that naturally builds on evaluate_bounds)

One way maybe we could support the usecase of code that called the existing evaluate_bounds function with the new API could be to make it easy to a Statistics object ranges on an expression.

For example, given existing code like

let input_intervals: Vec<&Interval> = ....;
let output_interval = expr.evaluate_bounds(&input_itervals)?;
...

Might look something like

let input_intervals: Vec<&Interval> = ....;
// wrap input intervals with Statistics
let temp_statistics = Statistics::new_from_bounds(&input_intervals);
// compute output column statistics
let output_column_statistics = expr.column_statistics(&temp_statistics)?;
// use the output value if it was known
let output_interval. = match output_column_statstics.value() {
  Precision::Absent | Precision::PointEstimation => None,
  Precision::Interval(interval) => interval
};

@suremarc
Copy link
Contributor

I think this is a great insight. Maybe the problem is that we are trying to overload Statistics to keep both types of information (statistics and bounds) 🤔

Having attempted to implement a new Precision API looking something like this:

struct Precision<T> {
    lower: Option<T>,
    upper: Option<T>,
    point_estimate: Option<T>,
}

I did indeed notice that just about all of the code dealing with cardinality estimates (total_byte_size, distinct_count, null_count, num_rows, etc.) only cared about point_estimate. Meanwhile, for column min/max statistics, we only really care about the upper & lower bounds.

Notwithstanding, I had already taken a stab at implementing this layout for ColumnStatistics (which is what @alamb proposed):

pub struct ColumnStatistics {
    pub null_count: Precision<usize>,
    // The value (or range) that this column takes on
    pub value: Precision<ScalarValue>,
    pub distinct_count: Precision<usize>,
}

As far as I can gather from looking at the codebase, current use cases for column statistics center around cardinality estimates (null_count and distinct_count) and lower/upper bounds (which is what value) does. However, this API is technically a regression, as we lose the ability to express when the lower/upper bounds are "exact". My use case for #13296 is to have conservative min/max bounds, so I would be happy to make this change, however having exact min/maxes would make certain optimizations possible, such as evaluating MIN(value) or MAX(value) only by looking at column statistics without reading any data.

The other option is to keep ColumnStatistics the same (keep both min_value and max_value), and include a Precision for both min_value and max_value. But this will increase the size of Statistics significantly.

@ozankabak
Copy link
Contributor

let input_intervals: Vec<&Interval> = ....;
// wrap input intervals with Statistics
let temp_statistics = Statistics::new_from_bounds(&input_intervals);
// compute output column statistics
let output_column_statistics = expr.column_statistics(&temp_statistics)?;
// use the output value if it was known
let output_interval. = match output_column_statstics.value() {
  Precision::Absent | Precision::PointEstimation => None,
  Precision::Interval(interval) => interval
};

This usage is weird in contexts where the concept of statistics isn't even applicable. I agree that it will work, but only so because we are forcing. I think the right pattern is to have column_statistics simply use evaluate_bounds as a subroutine for computing hard bounds (and any other information of type (2) in my comment). For example, it would be quite natural for us to have have things like evaluate_probability (not a great name!) that also takes expressions and does some sort of probabilistic computation (maybe PDF related). Then, column_statistics would also use evaluate_probability as a subroutine.

So I see something like column_statistics as a more general API that defers to lower level APIs to collect information.

@alamb
Copy link
Contributor Author

alamb commented Dec 24, 2024

FWIW there is a current move to add statistics into the arrow format itself:

I actually think we could standardize on converting to/from that format.

I don't think the arrow proposal handles the subtelty about intervals, known facts, etc but we should at least be aware of them (thanks @edmondop for pointing this out to me)

@ozankabak
Copy link
Contributor

I don't think the arrow proposal handles the subtelty about intervals, known facts, etc but we should at least be aware of them (thanks @edmondop for pointing this out to me)

Definitely. We will probably leverage that information to construct our own Statistics.

BTW we will start working on this tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants