Incorrect approx_percentile / approx_median results for small datasets due to unnecessary t-digest interpolation

### Describe the bug

DataFusion’s `approx_percentile_cont` and `approx_median` use a t-digest internally. However, when the number of input rows is small enough that no centroid compression occurs, each centroid represents exactly one data point (weight = 1).

In this scenario, the t-digest interpolation step is still applied, even though it assumes centroids represent clusters of multiple points. This leads to incorrect results that:

- Do not match exact continuous percentiles (percentile_cont)
- Do not match expected discrete nearest-rank semantics (as in Databricks)
- Drift away from actual observed values

This behavior is particularly surprising for small datasets and unit tests, where users expect percentile outputs to correspond to real data points.

### To Reproduce

### Steps to reproduce the behavior:

#### Create a small dataset (e.g., TPC-DS call_center table with ~14 rows)

Run:

```sql
select approx_percentile(cc_sq_ft, 0.85) from call_center;
```

Observe the output

Another example:

```sql
select approx_median(value)
from (values (-85), (-72), (-56), (-48), (-43), (-25), (-12), (-5), (45), (83)) t(value);
```

### Expected behavior

For small datasets where no t-digest compression occurs:

Results should follow nearest-rank semantics (e.g., ceil(q * n) - 1)
Returned values should be actual observed data points
Behavior should align with systems like Databricks (percentile_approx)

Examples:

- `approx_percentile(cc_sq_ft, 0.85)` → 84336
- `approx_median(...)` → should return a valid input value (not interpolated like -32)

### Additional context

Currently, DataFusion applies t-digest interpolation even when:

```
number_of_centroids == number_of_input_rows
```

This indicates no compression has occurred, making interpolation both unnecessary and incorrect.

The issue stems from using the interpolation path in estimate_quantile regardless of centroid structure. In the no-compression regime, the algorithm should instead switch to an exact nearest-rank computation.

This inconsistency leads to unintuitive and incorrect results, especially in:

- Small datasets
- Window functions with small frame sizes
- Unit tests expecting deterministic outputs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect approx_percentile / approx_median results for small datasets due to unnecessary t-digest interpolation #21528

Describe the bug

To Reproduce

Steps to reproduce the behavior:

Create a small dataset (e.g., TPC-DS call_center table with ~14 rows)

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Incorrect approx_percentile / approx_median results for small datasets due to unnecessary t-digest interpolation #21528

Description

Describe the bug

To Reproduce

Steps to reproduce the behavior:

Create a small dataset (e.g., TPC-DS call_center table with ~14 rows)

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions