Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28302: Let SUM UDF return NULL when all rows have non-numeric texts #5283

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

okumin
Copy link
Contributor

@okumin okumin commented Jun 6, 2024

What changes were proposed in this pull request?

Make SUM return not 0.0 but NULL when both of the following conditions are satisfied.

  • The input type is a STRING like type
  • All values are non-numeric and it is impossible to successfully cast to DOUBLE

https://issues.apache.org/jira/browse/HIVE-28302

Why are the changes needed?

I understand the SQL standard requires a UDAF to return NULL when all rows are NULL. Actually, other UDAFs such as AVG behave like that.

We can see some more discussions in #5091

Does this PR introduce any user-facing change?

Yes. But I believe the current behavior is not an intentional one but a kind of bug.

Is the change a dependency upgrade?

No.

How was this patch tested?

Updated integration tests.

@okumin okumin changed the title [WIP] HIVE-28302: Let SUM UDF return NULL when all rows have non-numeric texts [WIP] Jun 7, 2024
@okumin okumin changed the title [WIP] HIVE-28302: Let SUM UDF return NULL when all rows have non-numeric texts Jun 7, 2024
@@ -92,4 +92,4 @@ FROM src
POSTHOOK: type: QUERY
POSTHOOK: Input: default@src
#### A masked pattern was here ####
0.0 NULL NULL NULL
NULL NULL NULL NULL
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may remove this test case as cbo_aggregate_reduce_functions_rule.q covers the case.
https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/udaf_number_format.q

@okumin okumin marked this pull request as ready for review June 7, 2024 10:31
@kasakrisz
Copy link
Contributor

In the test cbo_aggregate_reduce_functions_rule.q aggregate functions under test (sum, count, stddev*, etc) are wrapped into a round function call. I understand the reason however I think it can cause some noise: if an expression like

ROUND(SUM(c_numeric), 3)

returns invalid result it can have multiple reasons: sum is wrong or round is wrong or both.
In theory if both are wrong as a side effect the expression itself can return a good result in some cases hence a potential bug in both function implementations remains hidden.

WDYT?

Copy link

sonarcloud bot commented Jul 6, 2024

@okumin
Copy link
Contributor Author

okumin commented Jul 7, 2024

@kasakrisz I don't have any objections. The original intention is human-readability and floating-point error mitigation. I tried to remove ROUND.

To be fair, we observe difference between CBO and non-CBO.

< +228.0  228.0   0.0     0.0     193.0   193.0   28.5    28.5    NULL    NULL    32.166666666666664      32.166666666666664      47.34448225506326       47.34448225506326       NULL    NULL    53.52387836802893       53.52387836802893       50.61338050075578  50.61338050075578       NULL    NULL    58.63247109466449       58.63247109466449       2241.5  2241.5  NULL    NULL    2864.805555555555       2864.805555555555       2561.714285714286       2561.714285714286       NULL    NULL    3437.7666666666664 3437.7666666666664      228.0   228.0   NULL    NULL    193.0   193.0   8       8       8       0       8       6
---
> +228.0  228.0   0.0     0.0     193.0   193.0   28.5    28.5    NULL    NULL    32.166666666666664      32.166666666666664      47.34448225506326       47.34448225506326       NULL    NULL    53.52387836802894       53.52387836802894       50.61338050075578  50.61338050075578       NULL    NULL    58.632471094664496      58.632471094664496      2241.5  2241.5  NULL    NULL    2864.805555555556       2864.805555555556       2561.714285714286       2561.714285714286       NULL    NULL    3437.7666666666673 3437.7666666666673      228.0   228.0   NULL    NULL    193.0   193.0   8       8       8       0       8       6

For better visibility,

< 53.52387836802893
< 53.52387836802893
---
> 53.52387836802894
> 53.52387836802894
71,72c71,72
< 58.63247109466449
< 58.63247109466449
---
> 58.632471094664496
> 58.632471094664496
77,78c77,78
< 2864.805555555555
< 2864.805555555555
---
> 2864.805555555556
> 2864.805555555556
83,84c83,84
< 3437.7666666666664
< 3437.7666666666664
---
> 3437.7666666666673
> 3437.7666666666673

I presume they are acceptable as floating-point errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants