Introduce a way to represent constrained statistics / bounds on values in Statistics #8078

alamb · 2023-11-07T10:34:32Z

Is your feature request related to a problem or challenge?

This has come up a few times, most recently in discussions with @berkaysynnada on apache/arrow-rs#5037 (comment)

Usecase 1 is that for large binary/string columns, formats like parquet allow storing a truncated value that does not actually appear in the data. Given that values are stored in the min/max metadata, storing truncated values keeps the size of metadata down

For example, for a string column that has very long values, it requires much less space to store a short value slightly lower than the actual minimum as the "minimum" statistics value, and one that is slightly higher than the actual maximum as the "maximum" statistics value.

For example:

actual min in data	actual max in data	"min" value in statistics	"max" value in statistics
`aaa......z`	`qqq......q`	`a`	`r`

There is a similar usecase when applying a Filter, as described by @korowa on #5646 (comment) and we have a similar one in IOx where the operator may remove values, but won't decrease the minimum value or increase the maximum value in any column

Currently Precision only represents Exact and Inexact, there is no way to represent "unexact, but bounded above/below"

Describe the solution you'd like

Per @berkaysynnada I propose changing Precision::Inexact to a new variant Precision::Between which would store an Interval of known min/maxes of the value.

enum Precision {
  ...
  /// The value is known to be in the specified interval
  Between(Interval)
}

This is a quite general formulation, and it can describe "how" inexact the values are.

This would have the benefit of being very expressive (Intervals can represent open/closed bounds, etc)

Describe alternatives you've considered

There is also a possibility of introducing a simpler, but more limited version of these statistics, like:

enum Precision {
  // The value is known to be within the range (it is at at most this large for Max, or at least this large for Min)
  // but the actual values may be lower/higher. 
  Bounded(ScalarValue)
}

Additional context

No response

The text was updated successfully, but these errors were encountered:

tustvold · 2023-11-07T10:48:54Z

A further wrinkle that may be worth thinking about, even if the solution is not to do anything, is that some formats, including parquet, special case certain floating point values. There is some discussion on apache/parquet-format#196 to try to make the statistics actually usable, and one potential outcome might be to just use totalOrder, so it may even be no action is required, but just an FYI

berkaysynnada · 2023-11-07T15:07:15Z

I have two proposals:

If we have the cases where the lower bound of a range might be inexact (or perhaps absent) and the upper bound is exact—or vice versa—we could define the following:

enum Estimation {
    Exact(T),
    Inexact(T),
    Absent,
}
enum Precision {
    Estimation,
    // min - max
    Range(Estimation, Estimation)
}

If we don't need to consider the exactness of bounds and a simple interval suffices, this alternative could be used:

enum Precision {
    Exact(T),
    Inexact(T),
    Absent,
    Between(Interval)
}

alamb · 2023-11-07T15:46:36Z

enum Precision {
    Exact(T),
    Inexact(T),
    Absent,
    Between(Interval)
}

@berkaysynnada under what circumstances would we use Inexact? All the cases I can think of Inexact now would be handled by Between (potentially with unbounded intervals)

alamb · 2023-11-07T15:57:15Z

Actually, upon more thought I think the best next step here is for me to prototype what this would look like. I'll do that and report back

berkaysynnada · 2023-11-07T19:20:31Z

enum Precision {
    Exact(T),
    Inexact(T),
    Absent,
    Between(Interval)
}
@berkaysynnada under what circumstances would we use Inexact? All the cases I can think of Inexact now would be handled by Between (potentially with unbounded intervals)

You are right, I cannot find any case to use Inexact. Any exact value can be relaxed to positive or negative side using intervals.

ozankabak · 2023-11-07T20:03:19Z

In proposal two, you still need Inexact; e.g. cases involving selectivity. Assume that you have a column x whose bounds are [1, 100]. After applying a filter of x <= 10, the row count estimate is an inexact estimate of 0.1 * N where N was the estimate prior to the filter.

On the contrary: If you have Between, you do not need Exact anymore -- An Exact value is simply a Between value with a singleton interval inside.

BTW @berkaysynnada's first proposal is the more general one: It allows range estimations where either bound can be exact or inexact estimations. His second proposal is slightly simpler to implement (and leverages the interval library), but it constrains range estimations to exact bounds.

When we decide to support slightly more complicated filters, say x <= y, we will probably need range estimations where both sides could be inexact. Therefore, I think it makes sense to keep things flexible and go with the first proposal. Obv. happy to discuss other perspectives.

alamb · 2023-11-07T20:50:00Z

I think there is a mismatch between the current use of Precision::Inexact and its documentation

Specifically, Precision::Inexact appears to be treated as a "conservative estimate" in several places (which is actually what we need in IOx), so perhaps another option would be to rename Precision::Exact to Precision::Conservative and document that it is a conservative estimate of the actual value 🤔

For example the comments on Precision::Inexact say

https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L32-L33

However, it is used to skip processing files when the (inexact) row count is above the fetch/limit:

https://github.com/apache/arrow-datafusion/blob/87aeef5aa4d8f38a8328f8e51e530e6c9cd9afa9/datafusion/core/src/datasource/statistics.rs#L69-L73

I am pretty sure this is only valid if the estimate of the number of rows is conservative (aka the real value is at least as large as the statistics), but the code uses the value even for Precision::Inexact:

https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L42-L46

Another example is ColumnStatistics::is_singleton, which I think is also only correct if the statistics are conservative (aka the actual min is no lower than the reported min and the actual max is no larger than the reported mx)

https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L281-L286

alamb · 2023-11-07T20:53:12Z

Another challenge I found with using Interval as proposed in option 2 of #8078 (comment) is more mechanical but real: Interval is in the datafusion_physical_expr crate but Precision is in datafusion_common, meaning I can't use Interval in Precision without moving code around. Not impossible to do, but a data point.

I will try prototyping option 1 sugested by @berkaysynnada #8078 (comment) and see what I find

berkaysynnada · 2023-11-08T12:02:23Z

I think there is a mismatch between the current use of Precision::Inexact and its documentation

Specifically, Precision::Inexact appears to be treated as a "conservative estimate" in several places (which is actually what we need in IOx), so perhaps another option would be to rename Precision::Exact to Precision::Conservative and document that it is a conservative estimate of the actual value 🤔

I think the documentation is correct but you have found 2 bugs in the code:

Here, we must ensure the values are exact.
https://github.com/apache/arrow-datafusion/blob/87aeef5aa4d8f38a8328f8e51e530e6c9cd9afa9/datafusion/core/src/datasource/statistics.rs#L69-L73
Singleton check must be for exact values only.
https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L281-L286

I will open a PR now fixing these issues.

berkaysynnada · 2023-11-08T12:06:39Z

Another challenge I found with using Interval as proposed in option 2 of #8078 (comment) is more mechanical but real: Interval is in the datafusion_physical_expr crate but Precision is in datafusion_common, meaning I can't use Interval in Precision without moving code around. Not impossible to do, but a data point.

I will try prototyping option 1 sugested by @berkaysynnada #8078 (comment) and see what I find

Just so you know, I have completed the interval refactor and submitted it for our internal review. After that, interval arithmetic will be in datafusion_expr, and there will be no significant obstacles to moving it to datafusion_common.

ozankabak · 2023-11-09T19:51:55Z

Any thoughts on @berkaysynnada's proposal; i.e.

enum PointEstimate {
    Exact(T),
    Inexact(T),
    Absent,
}
enum Precision {
    PointEstimate,
    Range(PointEstimate, PointEstimate)
}

I think extending the Precision struct like this, along with bugfixes like #8094, will address #8099.

BTW this exercise revealed that the name Precision maybe is not the best name. Maybe we should just use Estimate, which could either be a PointEstimate or a Range, which is tuple of PointEstimates for bounds.

alamb · 2023-11-09T20:30:44Z

Any thoughts on @berkaysynnada's proposal; i.e.

I really like it.

My plan was spend time trying to code up some of these proposals and see if it can fit in the existing framework (and thus potentially reveal in what cases Inexact would be used), etc

However, I spent much more of my time this week on apache/arrow-rs#5050 and related discussions around #7994 and #8017 than I had planned, so I am sadly behind

I hope to make progress tomorrow

alamb · 2023-11-10T20:12:01Z

It is unlikely I will have time to work on this today -- I will pick it up next week

alamb · 2023-11-14T14:57:35Z

I have been playing and studying this code. While the suggestion from @ozankabak and @berkaysynnada in #8078 (comment) is very general and can represent many types of uncertainty in statistics, I haven't found cases yet where that full generality is important

For example, I can't find (nor think of) an important case where the lower bound would be known with certainty and the upper bound was uncertain vs TYPE::MAX).

Another example would be a use case where distinguishing between ranges like

min: `PointEstimate::Absent`, max: `PointEstimate::Precise(value)`
min: PointEstimate::Precise(TYPE::MIN), max: PointEstimate::Precise(value)

Thus I am going to prototype what adding Bounded variant to Precision looks like.

I also plan to encapsulate more of the checks into Precision so that if choose to go with a more general formulation we won't have to change as much of the rest of the code.

pub enum Precision<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
    /// The exact value is known
    Exact(T),
    /// The exact value is not known, but the real value is known to be within
    /// the specified range: `lower <= value <= upper` TOOD: we could use
    /// `Interval` here instead, which could represent more complex cases (like
    /// open/closed bounds)
    Bounded { lower: T, upper: T},
    /// The value is not known exactly, but is likely close to this value.
    /// NOTHING can assumed about the value for cor in this case.
    Inexact(T),
    /// Nothing is known about the value
    #[default]
    Absent,
}

I'll report back here with how it goes shorty

ozankabak · 2023-11-14T16:11:59Z

When we decide to support slightly more complicated filters, say x <= y, we will probably need range estimations where both sides could be inexact. Therefore, I think it makes sense to keep things flexible and go with the first proposal. Obv. happy to discuss other perspectives.

Do we want to support something like this in the short term? Then we would need inexact bounds on both sides.

alamb · 2023-11-14T16:29:30Z

Do we want to support something like this in the short term? Then we would need inexact bounds on both sides.

Ok. Thank you for this point. I am still working through exactly what "inexact" means / when it would be used, I probably misunderstand.

I plan to work on this ticket in several PRs. In the first few I plan to centralize the use of Precision a bit more (e.g. #8172) so it is clearer how it is used

alamb · 2023-11-14T18:03:45Z

After some thought, I think the key information we are trying to capture in statistics are:

Known min/max values
Estimated values that are imprecise (the result of applying some sort of estimate), but are better than just "somewhere between min/max"

I don't think the proposal (at least as I understand it) in #8078 (comment), can capture the known min/max values

However, perhaps something like this could:

pub enum Precision<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
    /// The exact value is known
    Exact(T),
    /// The exact value is not known, but the real value is **known** to be within
    /// the specified range: `lower <= value <= upper` and furthermore is estimated
    /// to be `estimate`. However, the real value may fall anywhere in the range
    /// TODO: we could use `Interval` here instead, which could represent more complex cases (like
    /// open/closed bounds rather than using T::MIN/T::MAX)
    Bounded { lower: T, estimate: T upper: T},
    /// Nothing is known about the value
    #[default]
    Absent,
}

Maybe estimate should be Option<T> rather than just T 🤔

Example1: Filter `x <= 10`

In the example of a filter where you know the column x has bounds [1, 100] and the row_count is N.

After applying a filter of x <= 10, assuming an even distribution you get a selectivity of 0.1 , so we can conclude the following about the output row_count:

It is still between 0 and N (the max value hasn't been changed by the filter)
We estimate that the value is near 0.1 * N given our cardinality estimation, but this is just an estimate

As I understand @berkaysynnada 's proposal #8078 (comment), it would represent the output as either as:

Precision::PointEstimate(Inexact(0.1*N))

or

Precision::Range(
  PointEstimate::Exact(0), // min
  PointEstimate::Inexact(0.1 * N), // max
)

But neither captures the fact that we know for sure that row count lies between 0 and N (the max wasn't changed by the filter). Using a Bounded variant could capture this information:

Precision::Bounded {
  lower: 0,
  estimate: 0.1 * N,
  upper: 100
}

Here is the setup in pictures for those who like that

     Output Statistics  for row_count
      value: estimated to be 0.1*N           
      min: Exactly 0           
      max: Exactly 100 (can not possibly be higher)         
                               
             ▲                 
             │                 
┌────────────┴────────────┐    
│                         │    
│       FilterExec        │    
│         x <= 10         │    
│                         │    
└─────────────────────────┘    
             ▲                 
             │                 
             │                 
                               
    Input Statistics for row count:     
     value: Exactly N      
     min: Exactly 0            
     max: Exactly 100

`x <= y` example

The other example raised above is a predicate like x <= y. I assume the challenge is now how do we represent the column ranges with this information. Let's say we know that x was between [50, 100] and y was between [0, 200] and there were N input rows

Assuming even and uncorrelated distribution, we would get a selectivity estimate of 0.75

  │                                                      │                      
  │                                                      │                      
  │──────────────────────────────────────────────────────┤       y              
  │                                                      │       min: 0         
  │                                                      │       max: 200       
                                                                                
  0             │                      │                200                     
                │                      │                                        
                ├──────────────────────┤                         x              
                │                      │                         min: 50        
                │                      │                         max: 100       
                                                                                
                0                     200                                       
                                                                                
                                                                                
               **     Only range that can be true      **                       
                **                                    **                        
                 **************************************

We could again represent the output statistic with the proper range as well as the cardinality estimate like this, which both captures the fact y can't be less than 50 as well as the 0.75 selectivity):

x: Bounded { lower: 50, estimate: 0.75 * N, upper: 100 }, 
y: Bounded { lower: 50, estimate: 0.75 * N, upper: 200 }

ozankabak · 2023-11-14T18:47:16Z

Thank you for investigating this. There are some parts I don't quite follow. Particularly,

I don't think the proposal (at least as I understand it) in #8078 (comment), can capture the known min/max values

I don't think this is entirely accurate. In addition to the two configurations you identified, it can also be:

Precision::Range(
  PointEstimate::Exact(0), // min
  PointEstimate::Exact(N), // max
)

which captures the known min/max values. However, this loses our inexact guess 0.1 * N. I think what you are trying to convey is that @berkaysynnada's proposal can not represent the triple of (lower bound, guess, upper bound). If this is what you mean, I agree with you.

Example 2 `(x <= y)`

x: Bounded { lower: 50, estimate: 0.75 * N, upper: 100 }, 
y: Bounded { lower: 50, estimate: 0.75 * N, upper: 200 }

I don't understand what you mean by this. In this snippet, lower/upper bounds refer to column bounds, while estimate refers to row count. Did you mean to write:

x: Bounded { lower: 50, estimate: 75, upper: 100 }, 
y: Bounded { lower: 50, estimate: 125, upper: 200 }
row_count: Bounded { lower: 0, estimate: 62.5, upper: 100}

where the 62.5 figure is derived from the factor:

which is 5/8.

alamb · 2024-11-08T18:51:53Z

This has come up again recently -- I started a discussion here

RFC: Add Precision:AtLeast and Precision::AtMost for more Statistics… precision #13293

Suremarc has a PR here:

feat: Optimize SortPreservingMergeExec to avoid merging non-overlapping partitions #13296

ozankabak · 2024-11-10T15:32:52Z

Now that the interval library is in expr-common, is it possible we can use

pub enum Estimate {
    Range { bounds: Interval, value: ScalarValue },
    /// Nothing is known about the value
    #[default]
    Absent,
}

This would be great to have and would support all use cases as far as I can tell.

alamb · 2024-12-15T14:29:59Z

There was a bunch of discussion on #13293 that I would like to try and summarize:

Trying to combine the proposals from @crepererum
#13293 (comment) and @berkaysynnada on #13293 (comment) and my own hopes that we make it cheaper to manage Statistics:

Changes to `Statistics`

Statistics

pub struct Statistics {
    pub num_rows: Precision<usize>,
    pub total_byte_size: Precision<usize>,
    pub column_statistics: Vec<ColumnStatistics>,
}

I propose changing this to:

pub struct Statistics {
    pub num_rows: Precision<usize>,
    pub total_byte_size: Precision<usize>,
    pub column_statistics: Vec<Arc<ColumnStatistics>>,
}

Which would allow for quickly copying ColumnStatistics for unchanged columns.

An alternate may be to just use Arc<Statistics> everywhere (or maybe both 🤔 )

Changes to `ColumnStatistics`

At the time of writing, ColumnStatistics looks like

pub struct ColumnStatistics {
    pub null_count: Precision<usize>,
    pub max_value: Precision<ScalarValue>,
    pub min_value: Precision<ScalarValue>,
    pub distinct_count: Precision<usize>,
}

I believe the proposal involves changing Precision so it can capture the min/max data, which would result in a CoumnStatistics structure like this:

pub struct ColumnStatistics {
    pub null_count: Precision<usize>,
    // The value (or range) that this column takes on
    pub value: Precision<ScalarValue>,
    pub distinct_count: Precision<usize>,
}

Changes to `Precision`

Some constraints:

Precision today is a template which allows for both Precision<usize> and Precision<ScalarValue>
Interval already provides the ability to do range analysis and is based on ScalarValue

Options 1: Use `ScalarValue` exclusively

/// ...
enum Precision {
  /// Nothing is known about the value
  Absent,
  /// Estimated, but very close to the given point.
  ///
  /// This can be treated as an over-simplified normal distribution.
  PointEstimation(ScalarValue),
  /// Value is for sure in the given Interval. If min=max the value is known to be a single value
  Interval(Interval),
}

Pros: Can reuse Interval arithmetic and will allow us to hopefully unify Interval and statitics calculations
Cons: Will be less efficient for statistics such as count which we know must be usize

Options 2: Precision remains parameterized on `T`

/// ...
///
/// # Note
/// Unknown ranges are modeled as `Precision::Range{lower: None, upper: None}`,
enum Precision {
  /// Estimated, but very close to the given point.
  ///
  /// This can be treated as an over-simplified normal distribution.
  PointEstimation(T),
  
  /// Value is for sure in the given open/half-open/closed range.
  ///
  /// If `lower`/`upper` are given, they are INCLUSIVE ends.
  Range {
    lower: Option<T>,
    upper: Option<T>,
  },
}

Cons: More natural analysis for statistics such as count which we know must be usize
Pros: Harder to reuse Interval arithmetic and will allow us to hopefully unify Interval and statistics calculations

Options 3: Change interval to be parameterized on `T`

/// ...
enum Precision<T> {
  /// Nothing is known about the value
  Absent,
  /// Estimated, but "close" to the given point.
  ///
  /// This can be treated as an over-simplified normal distribution and used for 
  /// Cardinality estimation, for example
  PointEstimation(T),
  /// Value is for sure in the given Interval. If min=max the value is known to be a single value
  Interval(Interval<T>),
}

Cons: The most work/ changes required
Pros: Unifies Intervals and Statistics

Unify Statistics and `evaluate_bounds`

Today PhysicalExpr::evaluate_bounds computes the output bounds of an expression output using Intervals for the inputs.

trait PhysicalExpr { 
...
fn evaluate_bounds(
    &self,
    _children: &[&Interval],
) -> Result<Interval, DataFusionError>
...

@gatesn proposes adding an API that does something similar in:

Add sum statistics and PhysicalExpr::column_statistics #13736

His API computes the output ColumnStatistics given the input Statistics, which I believe is a more general concept than just evalute_bounds (as it encompasses counts as well as the value)

trait PhysicalExpr { 
...
    fn column_statistics(&self, _statistics: &Statistics) -> Result<ColumnStatistics> {
...

If we allowed Precision to represent Intervals (Option 3 above)I think we could deprecate evaluate_bounds and simply use column_statistics as it would be a strict superset .

Incremental steps to get us there:

The changes contemplated above are substantial and potentially quite disruptive to downstream APIs.

I propose a phased approach:

Step 1 is to make the fields of Statistics / ColumnStatistics/ Precision not pub and only allow access through accessors (this will permit us to 1) change internal structures with more limited impact and 2) formalize the API by which statistics are manipulated.
Change Interval to be Interval<T> which will allow Interval<usize> types
Change Precision to follow Option 3 above
Change evalute_bounds to column_statistics

ozankabak · 2024-12-16T14:21:24Z

I like the incremental approach and the ultimate desiderata on functionality, but I have some concerns on the "how" part. I will share my thoughts below, hopefully we will get to a neat design that we will not need to change/mess with again relatively soon.

Precision

There are two related, but independent concepts we need to track here:

Summary statistics; i.e. estimations about location and dispersion (spread).
Definitive mathematical facts, such as the range or a guaranteed superset thereof. In plain English, "hard facts" that can never be violated.

(1) and (2) have different use cases. We typically use (1) for optimization (join algorithm and/or side selection is a good example), and we end up with performance gains when estimations are accurate. If not, we may end up with bad performance, but we shouldn't get incorrect results. However, we use (2) for more critical decisions, like pruning data structures (e.g. trimming hash tables because we know a match has become impossible for a section of the table). A problem in (2) causes incorrect results to be generated.

(1) and (2) are not mutually exclusive. Sometimes we have info on both, sometimes only one but not the other, sometimes we don't know anything about either. The type definition as stated currently doesn't accommodate this, but this is an easy fix.

Column statistics vs Expression evaluation

Evaluating an expression for bounds is a "low-level" concept - a slight generalization of normal expression evaluation where one uses interval arithmetic (or another range computation technique) instead of ordinary arithmetic. There are many use cases where this is what is required, and other statistical concepts such as distinct count and/or null count is not even applicable. We actually have use cases such as this at Synnada. One can discuss what the range of an expression is without any dataset at hand -- just ranges of each symbol is enough.

On the other hand, computing column statistics is a "higher-level" task where computing the range is just one component of what is sought after. All in all, I don't currently think merging these things is a good idea. I see that a mechanism for "stats in -> stats out" kind of computations seems to be necessary, but IMO any such mechanism should use evaluate_bounds as a subroutine.

ozankabak · 2024-12-16T14:27:36Z

BTW, we would be happy to devote some person-hours to contribute to this project and get it over the finish line. We happen to have thought (and worked on) quite intensively on this subject and would be happy to help contribute some of those understandings/code back to upstream DF community.

alamb · 2024-12-16T15:17:41Z

(1) and (2) are not mutually exclusive. Sometimes we have info on both, sometimes only one but not the other, sometimes we don't know anything about either. The type definition as stated currently doesn't accommodate this, but this is an easy fix.

I think this is a great insight. Maybe the problem is that we are trying to overload Statistics to keep both types of information (statistics and bounds) 🤔

The aggregate_stastics pass in the optimizer uses Statistics for the bounds use case (aka makes transformations that rely on values in Statistics for correctness):

We hit a bug recently about this: return absent stats when filters are pushed down #12471 from @waruto210
Bounds are needed for @suremarc 's work in Optimized version of SortPreservingMerge that doesn't actually compare sort keys of the key ranges are ordered #10316

What is the easy fix you envision? One thing I could think of would be to add a structure similar to Statistics but explicitly for tracking the known bounds of a column

/// Known bounds for a set of columns
/// 
/// This information is guaranteed to be correct and is used for
/// optimization and correctness transformations. See [`Statistics`] for 
/// 
struct Bounds {
  /// Interval, if known, for each output column
  inner: Vec<Option<Interval>>,
}

However, aggregate_stastics does use column count (not just value ranges) so we would have to work that in somehow

Column statistics vs Expression evaluation

This also makes sense (that statistics propagation is a higher level concept that naturally builds on evaluate_bounds)

One way maybe we could support the usecase of code that called the existing evaluate_bounds function with the new API could be to make it easy to a Statistics object ranges on an expression.

For example, given existing code like

let input_intervals: Vec<&Interval> = ....;
let output_interval = expr.evaluate_bounds(&input_itervals)?;
...

Might look something like

let input_intervals: Vec<&Interval> = ....;
// wrap input intervals with Statistics
let temp_statistics = Statistics::new_from_bounds(&input_intervals);
// compute output column statistics
let output_column_statistics = expr.column_statistics(&temp_statistics)?;
// use the output value if it was known
let output_interval. = match output_column_statstics.value() {
  Precision::Absent | Precision::PointEstimation => None,
  Precision::Interval(interval) => interval
};

suremarc · 2024-12-16T15:42:26Z

I think this is a great insight. Maybe the problem is that we are trying to overload Statistics to keep both types of information (statistics and bounds) 🤔

Having attempted to implement a new Precision API looking something like this:

struct Precision<T> {
    lower: Option<T>,
    upper: Option<T>,
    point_estimate: Option<T>,
}

I did indeed notice that just about all of the code dealing with cardinality estimates (total_byte_size, distinct_count, null_count, num_rows, etc.) only cared about point_estimate. Meanwhile, for column min/max statistics, we only really care about the upper & lower bounds.

Notwithstanding, I had already taken a stab at implementing this layout for ColumnStatistics (which is what @alamb proposed):

pub struct ColumnStatistics {
    pub null_count: Precision<usize>,
    // The value (or range) that this column takes on
    pub value: Precision<ScalarValue>,
    pub distinct_count: Precision<usize>,
}

As far as I can gather from looking at the codebase, current use cases for column statistics center around cardinality estimates (null_count and distinct_count) and lower/upper bounds (which is what value) does. However, this API is technically a regression, as we lose the ability to express when the lower/upper bounds are "exact". My use case for #13296 is to have conservative min/max bounds, so I would be happy to make this change, however having exact min/maxes would make certain optimizations possible, such as evaluating MIN(value) or MAX(value) only by looking at column statistics without reading any data.

The other option is to keep ColumnStatistics the same (keep both min_value and max_value), and include a Precision for both min_value and max_value. But this will increase the size of Statistics significantly.

ozankabak · 2024-12-18T11:17:44Z

let input_intervals: Vec<&Interval> = ....;
// wrap input intervals with Statistics
let temp_statistics = Statistics::new_from_bounds(&input_intervals);
// compute output column statistics
let output_column_statistics = expr.column_statistics(&temp_statistics)?;
// use the output value if it was known
let output_interval. = match output_column_statstics.value() {
  Precision::Absent | Precision::PointEstimation => None,
  Precision::Interval(interval) => interval
};

This usage is weird in contexts where the concept of statistics isn't even applicable. I agree that it will work, but only so because we are forcing. I think the right pattern is to have column_statistics simply use evaluate_bounds as a subroutine for computing hard bounds (and any other information of type (2) in my comment). For example, it would be quite natural for us to have have things like evaluate_probability (not a great name!) that also takes expressions and does some sort of probabilistic computation (maybe PDF related). Then, column_statistics would also use evaluate_probability as a subroutine.

So I see something like column_statistics as a more general API that defers to lower level APIs to collect information.

alamb · 2024-12-24T13:48:44Z

FWIW there is a current move to add statistics into the arrow format itself:

GH-38837: [Format] Add the specification for statistics schema arrow#45058

I actually think we could standardize on converting to/from that format.

I don't think the arrow proposal handles the subtelty about intervals, known facts, etc but we should at least be aware of them (thanks @edmondop for pointing this out to me)

ozankabak · 2024-12-24T15:31:49Z

I don't think the arrow proposal handles the subtelty about intervals, known facts, etc but we should at least be aware of them (thanks @edmondop for pointing this out to me)

Definitely. We will probably leverage that information to construct our own Statistics.

BTW we will start working on this tomorrow.

alamb added the enhancement New feature or request label Nov 7, 2023

This was referenced Nov 7, 2023

TPCH, Query 18 and 17 very slow #5646

Closed

Binary columns do not receive truncated statistics apache/arrow-rs#5037

Closed

alamb changed the title ~~Introduce a way to represent constrained staitistics~~ Introduce a way to represent constrained statistics / bounds on values in Statistics Nov 7, 2023

alamb mentioned this issue Nov 7, 2023

DataFusion weekly project plan (Andrew Lamb) - Nov 6, 2023 #8063

Closed

6 tasks

alamb self-assigned this Nov 7, 2023

berkaysynnada mentioned this issue Nov 8, 2023

Bug-fix in Filter and Limit statistics #8094

Merged

NGA-TRAN mentioned this issue Nov 8, 2023

Queries that lost statistics or their statistics become inexact #8099

Open

This was referenced Nov 13, 2023

DataFusion weekly project plan (Andrew Lamb) - Nov 13, 2023 #8151

Closed

FIFO JoinHashMap for HashJoin #8130

Closed

NGA-TRAN mentioned this issue Nov 13, 2023

Compute column statistics min & max for FilterExec and beyond #8155

Open

alamb mentioned this issue Nov 14, 2023

Minor: simplify DataSource statistics code #8172

Closed

alamb mentioned this issue Nov 20, 2023

DataFusion weekly project plan (Andrew Lamb) - Nov 20, 2023 #8274

Closed

7 tasks

alamb mentioned this issue Nov 27, 2023

DataFusion weekly project plan (Andrew Lamb) - Nov 27, 2023 #8329

Closed

8 tasks

This was referenced Dec 5, 2023

DataFusion weekly project plan (Andrew Lamb) - Dec 4, 2023 #8420

Closed

DataFusion weekly project plan (Andrew Lamb) - Dec 11, 2023 #8490

Closed

berkaysynnada mentioned this issue Dec 13, 2023

Union types not handled when building AnalysisContext::try_from_column #8499

Closed

alamb mentioned this issue Dec 18, 2023

DataFusion weekly project plan (Andrew Lamb) - Dec 18, 2023 #8577

Closed

7 tasks

This was referenced Dec 26, 2023

DataFusion weekly project plan (Andrew Lamb) - Dec 25, 2023 #8655

Closed

DataFusion weekly project plan (Andrew Lamb) - Jan 1, 2024 #8704

Closed

This was referenced Jan 8, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 8, 2024 #8786

Closed

DataFusion weekly project plan (Andrew Lamb) - Jan 15, 2024 #8864

Closed

alamb mentioned this issue Jan 21, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 22, 2024 #8933

Closed

9 tasks

alamb mentioned this issue Jan 28, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 29, 2024 #9030

Closed

6 tasks

alamb mentioned this issue Feb 4, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 5, 2024 #9121

Closed

6 tasks

alamb mentioned this issue Feb 12, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 12, 2024 #9200

Closed

8 tasks

alamb linked a pull request Nov 7, 2024 that will close this issue

RFC: Add Precision:AtLeast and Precision::AtMost for more Statistics… precision #13293

Draft

4 tasks

This was referenced Nov 7, 2024

feat: Optimize SortPreservingMergeExec to avoid merging non-overlapping partitions #13296

Draft

Optimized version of SortPreservingMerge that doesn't actually compare sort keys of the key ranges are ordered #10316

Open

alamb mentioned this issue Dec 15, 2024

Add sum statistics and PhysicalExpr::column_statistics #13736

Open

gatesn mentioned this issue Dec 18, 2024

Statistics Precision spiraldb/vortex#1708

Open

alamb mentioned this issue Jan 1, 2025

Jan 1, 2025: This week(s) in DataFusion #13970

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a way to represent constrained statistics / bounds on values in Statistics #8078

Introduce a way to represent constrained statistics / bounds on values in Statistics #8078

alamb commented Nov 7, 2023

tustvold commented Nov 7, 2023

berkaysynnada commented Nov 7, 2023

alamb commented Nov 7, 2023

alamb commented Nov 7, 2023

berkaysynnada commented Nov 7, 2023

ozankabak commented Nov 7, 2023 •

edited

Loading

alamb commented Nov 7, 2023

alamb commented Nov 7, 2023 •

edited

Loading

berkaysynnada commented Nov 8, 2023

berkaysynnada commented Nov 8, 2023

ozankabak commented Nov 9, 2023 •

edited

Loading

alamb commented Nov 9, 2023

alamb commented Nov 10, 2023

alamb commented Nov 14, 2023

ozankabak commented Nov 14, 2023

alamb commented Nov 14, 2023

alamb commented Nov 14, 2023

ozankabak commented Nov 14, 2023

alamb commented Nov 8, 2024

ozankabak commented Nov 10, 2024

alamb commented Dec 15, 2024

ozankabak commented Dec 16, 2024

ozankabak commented Dec 16, 2024

alamb commented Dec 16, 2024

suremarc commented Dec 16, 2024

ozankabak commented Dec 18, 2024

alamb commented Dec 24, 2024 •

edited

Loading

ozankabak commented Dec 24, 2024

Introduce a way to represent constrained statistics / bounds on values in Statistics #8078

Introduce a way to represent constrained statistics / bounds on values in Statistics #8078

Comments

alamb commented Nov 7, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

tustvold commented Nov 7, 2023

berkaysynnada commented Nov 7, 2023

alamb commented Nov 7, 2023

alamb commented Nov 7, 2023

berkaysynnada commented Nov 7, 2023

ozankabak commented Nov 7, 2023 • edited Loading

alamb commented Nov 7, 2023

alamb commented Nov 7, 2023 • edited Loading

berkaysynnada commented Nov 8, 2023

berkaysynnada commented Nov 8, 2023

ozankabak commented Nov 9, 2023 • edited Loading

alamb commented Nov 9, 2023

alamb commented Nov 10, 2023

alamb commented Nov 14, 2023

ozankabak commented Nov 14, 2023

alamb commented Nov 14, 2023

alamb commented Nov 14, 2023

Example1: Filter x <= 10

x <= y example

ozankabak commented Nov 14, 2023

Example 2 (x <= y)

alamb commented Nov 8, 2024

ozankabak commented Nov 10, 2024

alamb commented Dec 15, 2024

Changes to Statistics

Changes to ColumnStatistics

Changes to Precision

Options 1: Use ScalarValue exclusively

Options 2: Precision remains parameterized on T

Options 3: Change interval to be parameterized on T

Unify Statistics and evaluate_bounds

Incremental steps to get us there:

ozankabak commented Dec 16, 2024

Precision

Column statistics vs Expression evaluation

ozankabak commented Dec 16, 2024

alamb commented Dec 16, 2024

suremarc commented Dec 16, 2024

ozankabak commented Dec 18, 2024

alamb commented Dec 24, 2024 • edited Loading

ozankabak commented Dec 24, 2024

ozankabak commented Nov 7, 2023 •

edited

Loading

alamb commented Nov 7, 2023 •

edited

Loading

ozankabak commented Nov 9, 2023 •

edited

Loading

Example1: Filter `x <= 10`

`x <= y` example

Example 2 `(x <= y)`

Changes to `Statistics`

Changes to `ColumnStatistics`

Changes to `Precision`

Options 1: Use `ScalarValue` exclusively

Options 2: Precision remains parameterized on `T`

Options 3: Change interval to be parameterized on `T`

Unify Statistics and `evaluate_bounds`

alamb commented Dec 24, 2024 •

edited

Loading