Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CH]Optimize nested function calls #8557

Open
KevinyhZou opened this issue Jan 17, 2025 · 11 comments · May be fixed by #8558 or #8733
Open

[CH]Optimize nested function calls #8557

KevinyhZou opened this issue Jan 17, 2025 · 11 comments · May be fixed by #8558 or #8733
Labels
enhancement New feature or request

Comments

@KevinyhZou
Copy link
Contributor

KevinyhZou commented Jan 17, 2025

Description

In our online sqls, there are many nested functions calls, such as

  • select a.b.c from test_tbl => select get_struct_field((get_struct_field(a, b),c) from test_tbl
  • select count(1) from test_tbl where a = 1 and b = 2 and c = 3 => select count(1) from test_tbl where and(and(a = 1, b = 2), c = 3 )
  • select count(1) from test_tbl where a = 1 or b = 2 or c = 3 => select count(1) from test_tbl where or(or(a = 1, b = 2), c = 3 )
    .....

and these functions can be optimized by make the nested arguments collapsed , which means
get_struct_field((get_struct_field(a, b),c) => get_struct_field(a, b, c)
and(and(a = 1, b = 2), c = 3 ) => and(a =1, b=2, c=3)
or(or(a=1, b= 2), c=3) => or(a=1, b=2, c=3)
get_json_object(get_json_object(a, '$.b'), '$.c') => get_json_object(a, '$.b', '$.c')
.....

which can reduce the function calls and improve performance.

@lgbo-ustc
Copy link
Contributor

get_json_object(get_json_object(a, '$.b'), '$.c') => get_json_object(a, '$.b', '$.c')

印象之前是用spark rule改写成get_json_object(a, '$.b.c'), 这里为何改写成这样,这是spark function支持的形式?

@KevinyhZou
Copy link
Contributor Author

改写成get_json_object(a, '$.b.c') , 有些case 兼容不了。 get_json_object(a, '$.b', '$.c') 不是spark get_json_object 兼容的形式,这里是通过ScalaUDF 的方式将get_json_object 转为 ScalaUDF,来兼容get_json_object(a, '$.b', '$.c') 这种形式。 @lgbo-ustc

@lgbo-ustc
Copy link
Contributor

什么case

@KevinyhZou
Copy link
Contributor Author

  1. get_json_object(get_json_object({"a":"{\"x\":5}"}', '$.a'), '$.x'), the json string has backslashes to escape quotes ;
  2. get_json_object(get_json_object('{"a.b": 0}', '$.a), '$.b'), the json key contains dot character(.) and it's same as the collapsed json path;

you can also look at the comments at CollapseGetJsonObjectExpressionRule

@lgbo-ustc
Copy link
Contributor

Let's make things clear.
We cannot know the type of one field of a json. It could be a object or string. When the field's type is string, the corrent results for following cases are

query result
get_json_object(get_json_object(s, '$.a'), '$.b') something
get_json_object(s, '$.a.b') null

So we cannot use get_json_object(s, '$.a.b') to replace get_json_object(get_json_object(s, '$.a'), '$.b'). It will cause difference with vinilla for both following choices

  • keep the behavior of get_json_object(s, '$.a.b') as before, but the result is different from get_json_object(get_json_object(s, '$.a'), '$.b')
  • somehow let get_json_object(s, '$.a.b') have same result as get_json_object(get_json_object(s, '$.a'), '$.b'), but it will make the result of normal call get_json_object(s, '$.a.b') different from vanilla

@lgbo-ustc
Copy link
Contributor

lgbo-ustc commented Feb 13, 2025

The optimization for nested get_json_object should be put into a seperate issue and PR, I think it will be a large change

@lgbo-ustc
Copy link
Contributor

Is the rule in #8305 disable?

@KevinyhZou
Copy link
Contributor Author

rule in #8305 is disable by default as the not passed case.

@KevinyhZou
Copy link
Contributor Author

I have open a seperate PR for get_json_object: #8733 @lgbo-ustc , you can have a review at this.

@PHILO-HE
Copy link
Contributor

Hi @KevinyhZou, for get_json_object(get_json_object('{"a.b": 0}', '$.a'), '$.b'), we expect the result is still null after collapsed as get_json_object('{"a.b": 0}', '$.a.b'), right? I did a test in velox, it is null as expected (simdjson-based implementation). If CH got wrong result, I am wondering whether it is a bug in its native implementation.

@KevinyhZou
Copy link
Contributor Author

Yes, it should be the problem of ch backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants