HIVE-28302: Let SUM UDF return NULL when all rows have non-numeric texts #5283

okumin · 2024-06-06T12:30:34Z

What changes were proposed in this pull request?

Make SUM return not 0.0 but NULL when both of the following conditions are satisfied.

The input type is a STRING like type
All values are non-numeric and it is impossible to successfully cast to DOUBLE

https://issues.apache.org/jira/browse/HIVE-28302

Why are the changes needed?

I understand the SQL standard requires a UDAF to return NULL when all rows are NULL. Actually, other UDAFs such as AVG behave like that.

We can see some more discussions in #5091

Does this PR introduce any user-facing change?

Yes. But I believe the current behavior is not an intentional one but a kind of bug.

Is the change a dependency upgrade?

No.

How was this patch tested?

Updated integration tests.

okumin · 2024-06-07T10:30:58Z

ql/src/test/results/clientpositive/llap/udaf_number_format.q.out

@@ -92,4 +92,4 @@ FROM src
 POSTHOOK: type: QUERY
 POSTHOOK: Input: default@src
 #### A masked pattern was here ####
-0.0	NULL	NULL	NULL
+NULL	NULL	NULL	NULL


We may remove this test case as cbo_aggregate_reduce_functions_rule.q covers the case.
https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/udaf_number_format.q

kasakrisz · 2024-07-05T07:22:52Z

In the test cbo_aggregate_reduce_functions_rule.q aggregate functions under test (sum, count, stddev*, etc) are wrapped into a round function call. I understand the reason however I think it can cause some noise: if an expression like

ROUND(SUM(c_numeric), 3)

returns invalid result it can have multiple reasons: sum is wrong or round is wrong or both.
In theory if both are wrong as a side effect the expression itself can return a good result in some cases hence a potential bug in both function implementations remains hidden.

WDYT?

sonarcloud · 2024-07-06T16:50:14Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

okumin · 2024-07-07T13:02:51Z

@kasakrisz I don't have any objections. The original intention is human-readability and floating-point error mitigation. I tried to remove ROUND.

To be fair, we observe difference between CBO and non-CBO.

< +228.0  228.0   0.0     0.0     193.0   193.0   28.5    28.5    NULL    NULL    32.166666666666664      32.166666666666664      47.34448225506326       47.34448225506326       NULL    NULL    53.52387836802893       53.52387836802893       50.61338050075578  50.61338050075578       NULL    NULL    58.63247109466449       58.63247109466449       2241.5  2241.5  NULL    NULL    2864.805555555555       2864.805555555555       2561.714285714286       2561.714285714286       NULL    NULL    3437.7666666666664 3437.7666666666664      228.0   228.0   NULL    NULL    193.0   193.0   8       8       8       0       8       6
---
> +228.0  228.0   0.0     0.0     193.0   193.0   28.5    28.5    NULL    NULL    32.166666666666664      32.166666666666664      47.34448225506326       47.34448225506326       NULL    NULL    53.52387836802894       53.52387836802894       50.61338050075578  50.61338050075578       NULL    NULL    58.632471094664496      58.632471094664496      2241.5  2241.5  NULL    NULL    2864.805555555556       2864.805555555556       2561.714285714286       2561.714285714286       NULL    NULL    3437.7666666666673 3437.7666666666673      228.0   228.0   NULL    NULL    193.0   193.0   8       8       8       0       8       6

For better visibility,

< 53.52387836802893
< 53.52387836802893
---
> 53.52387836802894
> 53.52387836802894
71,72c71,72
< 58.63247109466449
< 58.63247109466449
---
> 58.632471094664496
> 58.632471094664496
77,78c77,78
< 2864.805555555555
< 2864.805555555555
---
> 2864.805555555556
> 2864.805555555556
83,84c83,84
< 3437.7666666666664
< 3437.7666666666664
---
> 3437.7666666666673
> 3437.7666666666673

I presume they are acceptable as floating-point errors.

asf-ci-hive added tests pending tests unstable and removed tests pending labels Jun 6, 2024

okumin changed the title ~~[WIP] HIVE-28302: Let SUM UDF return NULL when all rows have non-numeric texts~~ [WIP] Jun 7, 2024

okumin force-pushed the HIVE-28302-sum branch from 8be7c91 to c73280d Compare June 7, 2024 01:16

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Jun 7, 2024

HIVE-28302: Let SUM UDF return NULL when all rows have non-numeric texts

5337685

okumin force-pushed the HIVE-28302-sum branch from c73280d to 5337685 Compare June 7, 2024 06:03

asf-ci-hive added tests pending tests passed and removed tests unstable tests pending labels Jun 7, 2024

okumin changed the title ~~[WIP]~~ HIVE-28302: Let SUM UDF return NULL when all rows have non-numeric texts Jun 7, 2024

okumin commented Jun 7, 2024

View reviewed changes

okumin marked this pull request as ready for review June 7, 2024 10:31

Unround expressions in cbo_aggregate_reduce_functions_rule.q

8b8136e

asf-ci-hive added tests pending and removed tests passed labels Jul 6, 2024

asf-ci-hive added tests passed and removed tests pending labels Jul 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-28302: Let SUM UDF return NULL when all rows have non-numeric texts #5283

HIVE-28302: Let SUM UDF return NULL when all rows have non-numeric texts #5283

okumin commented Jun 6, 2024 •

edited

Loading

okumin Jun 7, 2024

kasakrisz commented Jul 5, 2024

sonarcloud bot commented Jul 6, 2024

okumin commented Jul 7, 2024

HIVE-28302: Let SUM UDF return NULL when all rows have non-numeric texts #5283

Are you sure you want to change the base?

HIVE-28302: Let SUM UDF return NULL when all rows have non-numeric texts #5283

Conversation

okumin commented Jun 6, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

okumin Jun 7, 2024

Choose a reason for hiding this comment

kasakrisz commented Jul 5, 2024

sonarcloud bot commented Jul 6, 2024

Quality Gate passed

okumin commented Jul 7, 2024

okumin commented Jun 6, 2024 •

edited

Loading