Describe the bug
DataFusion’s approx_percentile_cont and approx_median use a t-digest internally. However, when the number of input rows is small enough that no centroid compression occurs, each centroid represents exactly one data point (weight = 1).
In this scenario, the t-digest interpolation step is still applied, even though it assumes centroids represent clusters of multiple points. This leads to incorrect results that:
- Do not match exact continuous percentiles (percentile_cont)
- Do not match expected discrete nearest-rank semantics (as in Databricks)
- Drift away from actual observed values
This behavior is particularly surprising for small datasets and unit tests, where users expect percentile outputs to correspond to real data points.
To Reproduce
Steps to reproduce the behavior:
Create a small dataset (e.g., TPC-DS call_center table with ~14 rows)
Run:
select approx_percentile(cc_sq_ft, 0.85) from call_center;
Observe the output
Another example:
select approx_median(value)
from (values (-85), (-72), (-56), (-48), (-43), (-25), (-12), (-5), (45), (83)) t(value);
Expected behavior
For small datasets where no t-digest compression occurs:
Results should follow nearest-rank semantics (e.g., ceil(q * n) - 1)
Returned values should be actual observed data points
Behavior should align with systems like Databricks (percentile_approx)
Examples:
approx_percentile(cc_sq_ft, 0.85) → 84336
approx_median(...) → should return a valid input value (not interpolated like -32)
Additional context
Currently, DataFusion applies t-digest interpolation even when:
number_of_centroids == number_of_input_rows
This indicates no compression has occurred, making interpolation both unnecessary and incorrect.
The issue stems from using the interpolation path in estimate_quantile regardless of centroid structure. In the no-compression regime, the algorithm should instead switch to an exact nearest-rank computation.
This inconsistency leads to unintuitive and incorrect results, especially in:
- Small datasets
- Window functions with small frame sizes
- Unit tests expecting deterministic outputs
Describe the bug
DataFusion’s
approx_percentile_contandapprox_medianuse a t-digest internally. However, when the number of input rows is small enough that no centroid compression occurs, each centroid represents exactly one data point (weight = 1).In this scenario, the t-digest interpolation step is still applied, even though it assumes centroids represent clusters of multiple points. This leads to incorrect results that:
This behavior is particularly surprising for small datasets and unit tests, where users expect percentile outputs to correspond to real data points.
To Reproduce
Steps to reproduce the behavior:
Create a small dataset (e.g., TPC-DS call_center table with ~14 rows)
Run:
Observe the output
Another example:
Expected behavior
For small datasets where no t-digest compression occurs:
Results should follow nearest-rank semantics (e.g., ceil(q * n) - 1)
Returned values should be actual observed data points
Behavior should align with systems like Databricks (percentile_approx)
Examples:
approx_percentile(cc_sq_ft, 0.85)→ 84336approx_median(...)→ should return a valid input value (not interpolated like -32)Additional context
Currently, DataFusion applies t-digest interpolation even when:
This indicates no compression has occurred, making interpolation both unnecessary and incorrect.
The issue stems from using the interpolation path in estimate_quantile regardless of centroid structure. In the no-compression regime, the algorithm should instead switch to an exact nearest-rank computation.
This inconsistency leads to unintuitive and incorrect results, especially in: