docs: refine `AggregateUDFImpl::is_ordered_set_aggregate` documentation #17805

Jefffrey · 2025-09-27T02:19:55Z

Going through some tickets related to ordered set aggregates and got a little confused on DataFusion's support for them.

As I understand it, #13511 made WITHIN GROUP mandatory for ordered set aggregate functions, of which we support only two so far:

approx_percentile_cont
- Technically approx_median shares some internals with approx_percentile_cont but itself isn't an ordered set aggregation
approx_percentile_cont_with_weight (which uses approx_percentile_cont internally)

This was then amended in #16999 to make it optional, at least via the SQL API; it is still mandatory on the DataFrame API:

datafusion/datafusion/functions-aggregate/src/approx_percentile_cont.rs

Lines 53 to 58 in bbb5cc7

    
           /// Computes the approximate percentile continuous of a set of numbers 
        
           pub fn approx_percentile_cont( 
        
               order_by: Sort, 
        
               percentile: Expr, 
        
               centroids: Option<Expr>, 
        
           ) -> Expr {

I'm updating the doc here to try clarify things to my understanding, as a followup to the original doc update: #17744

Jefffrey

A question I have is if we should loosen the DataFrame API to allow omitting the sort, as #16999 did for the SQL API?

cc @alamb

Jefffrey · 2025-09-27T02:20:39Z

datafusion/expr/src/udaf.rs

+    /// calculation performed by these functions is dependent on the specific
+    /// sequence of the input rows, unlike other aggregate functions like `SUM`
+    /// `AVG`, or `COUNT`. If explit order is specified then a default order
+    /// of ascending is assumed.


Technically we don't enforce the default order; our only ordered set aggregate functions internally use ascending as the default, so I'm not sure if we should instead say its implementation dependent or try to enforce it somehow?

Jefffrey · 2025-09-27T02:21:00Z

datafusion/expr/src/udaf.rs

+    /// Note that setting this to `true` does not guarantee input sort order to
+    /// the aggregate function; it instead gives the function full control over
+    /// the sorting process and they are expected to handle order of input values
+    /// themselves.


I hope I'm correct in this; some reading I used for reference: https://paquier.xyz/postgresql-2/postgres-9-4-feature-highlight-within-group/

alamb · 2025-09-29T17:11:47Z

This was then amended in #16999 to make it optional, at least via the SQL API; it is still mandatory on the DataFrame API:

In my mind this is mostly for backwards compatibility reasons -- #13511 basically broke a bunch of our existing user queries, so I wanted to revert the unnecessarily strict interpretation

As I understand it, #13511 made WITHIN GROUP mandatory for ordered set aggregate functions, of which we support only two so far:

Indeed -- and both of these functions have the property that many times their argument will be the same as the ORDER BY WITHIN GROUP-- for example, computing approx_median(x) implicitly means approx_median(x ORDER BY x WITHIN GROUP)

Though allowing different arguments means you can write expressions like approx_median(first_name ORDER BY salary WITHIN GROUP) and save yourself a subquery

A question I have is if we should loosen the DataFrame API to allow omitting the sort, as #16999 did for the SQL API?

cc @alamb

I suggest we hold off unless someone explicitly asks about it, though I am not opposed to it either

alamb

Thank you @Jefffrey -- this seems like an improvement to me

alamb · 2025-09-29T17:15:20Z

datafusion/expr/src/udaf.rs

-    /// An example of an ordered-set aggregate function is `percentile_cont`
-    /// which computes a specific percentile value from a sorted list of values, and
-    /// is only meaningful when the input data is ordered.
+    /// Note that setting this to `true` does not guarantee input sort order to


this is a good clarification

If DataFusion ever supports more ordered set aggregation functions, we may want to revisit this

In addition to saying what this setting doesn't do, maybe we could also say what setting it to true does do? Specifically, it seems like it only affects the output display somehow 🤔

In addition to saying what this setting doesn't do, maybe we could also say what setting it to true does do?

This is a good point 🤔

Let me look into the code a bit more to clarify my understanding and I'll update the doc accordingly.

Jefffrey · 2025-09-30T08:26:09Z

In my mind this is mostly for backwards compatibility reasons -- #13511 basically broke a bunch of our existing user queries, so I wanted to revert the unnecessarily strict interpretation

I suggest we hold off unless someone explicitly asks about it, though I am not opposed to it either

I might raise a separate issue to track keeping the SQL API & DataFrame API in parity in regards to this; especially for when we consider adding more ordered set aggregate functions, if we should enforce WITHIN GROUP for those but not existing ones.

Though allowing different arguments means you can write expressions like approx_median(first_name ORDER BY salary WITHIN GROUP) and save yourself a subquery

I'm a bit confused by this example; is this just a hypothetical or something that is feasible with ordered set aggregate functions? I thought they would expected one column/expression which is the same as the ORDER BY in the WITHIN GROUP 🤔

alamb · 2025-09-30T18:30:03Z

I'm a bit confused by this example; is this just a hypothetical or something that is feasible with ordered set aggregate functions? I thought they would expected one column/expression which is the same as the ORDER BY in the WITHIN GROUP

I am clearly a little confused myself. Now I am not sure if there is some example where the arguments differ 🤔

Jefffrey · 2025-10-01T04:37:18Z

I'm a bit confused by this example; is this just a hypothetical or something that is feasible with ordered set aggregate functions? I thought they would expected one column/expression which is the same as the ORDER BY in the WITHIN GROUP

I am clearly a little confused myself. Now I am not sure if there is some example where the arguments differ 🤔

I'll move this PR back to draft and do some more research to update the docs, since it seems we're both still confused about this 😅

docs: refine AggregateUDFImpl::is_ordered_set_aggregate documentation

440ab7b

github-actions bot added the logical-expr Logical plan and expressions label Sep 27, 2025

Jefffrey commented Sep 27, 2025

View reviewed changes

I am very good at the English language

cb8facd

Jefffrey marked this pull request as ready for review September 27, 2025 02:36

alamb approved these changes Sep 29, 2025

View reviewed changes

Jefffrey marked this pull request as draft October 1, 2025 04:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: refine `AggregateUDFImpl::is_ordered_set_aggregate` documentation #17805

docs: refine `AggregateUDFImpl::is_ordered_set_aggregate` documentation #17805

Uh oh!

Jefffrey commented Sep 27, 2025

Uh oh!

Jefffrey left a comment

Uh oh!

Jefffrey Sep 27, 2025

Uh oh!

Jefffrey Sep 27, 2025

Uh oh!

alamb commented Sep 29, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Sep 29, 2025

Uh oh!

Jefffrey Sep 30, 2025

Uh oh!

Jefffrey commented Sep 30, 2025

Uh oh!

alamb commented Sep 30, 2025

Uh oh!

Jefffrey commented Oct 1, 2025

Uh oh!

Uh oh!

	/// Computes the approximate percentile continuous of a set of numbers
	pub fn approx_percentile_cont(
	order_by: Sort,
	percentile: Expr,
	centroids: Option<Expr>,
	) -> Expr {

docs: refine AggregateUDFImpl::is_ordered_set_aggregate documentation #17805

Are you sure you want to change the base?

docs: refine AggregateUDFImpl::is_ordered_set_aggregate documentation #17805

Uh oh!

Conversation

Jefffrey commented Sep 27, 2025

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Jefffrey Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

Jefffrey Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Sep 29, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Jefffrey Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Jefffrey commented Sep 30, 2025

Uh oh!

alamb commented Sep 30, 2025

Uh oh!

Jefffrey commented Oct 1, 2025

Uh oh!

Uh oh!

docs: refine `AggregateUDFImpl::is_ordered_set_aggregate` documentation #17805

docs: refine `AggregateUDFImpl::is_ordered_set_aggregate` documentation #17805