Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-8557][CH] Collapse nested function calls for And/Or for performance optimization #8558

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

KevinyhZou
Copy link
Contributor

@KevinyhZou KevinyhZou commented Jan 17, 2025

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

(Fixes: #8557)

How was this patch tested?

test by ut

@KevinyhZou KevinyhZou marked this pull request as draft January 17, 2025 09:11
@github-actions github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Jan 17, 2025
Copy link

#8557

Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou KevinyhZou changed the title [GLUTEN-8557][CH]Optimize nested function calls [GLUTEN-8557][CH]Optimize nested function calls for And, GetStructField Jan 17, 2025
@KevinyhZou KevinyhZou changed the title [GLUTEN-8557][CH]Optimize nested function calls for And, GetStructField [GLUTEN-8557][CH]Optimize nested function calls for And/GetStructField Jan 17, 2025
Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou KevinyhZou changed the title [GLUTEN-8557][CH]Optimize nested function calls for And/GetStructField [GLUTEN-8557][CH]Optimize nested function calls for And/Or/GetStructField/GetJsonObject Jan 21, 2025
@KevinyhZou KevinyhZou closed this Jan 21, 2025
@KevinyhZou KevinyhZou reopened this Jan 21, 2025
@KevinyhZou KevinyhZou marked this pull request as ready for review January 21, 2025 11:06
Copy link

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Feb 4, 2025

Run Gluten Clickhouse CI on x86

5 similar comments
Copy link

github-actions bot commented Feb 4, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Feb 5, 2025

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou
Copy link
Contributor Author

性能测试

性能测试:测试数据量6千万行,分别执行三次,查看端到端执行时间

get_json_object: select count(1) from test_tbl1 where get_json_object(get_json_object(d, '$.a'), '$.b') = 'c';
优化前:53.194s, 52.18s, 51.603s
优化后:24.894s, 25.042s, 24.724s

and: select count(1) from test_tbl34 where id = 1 and d != 'axx' and d != 'cxx' and d != 'zxx';
优化前: 3.918s,4.106s,3.835s
优化后: 3.123s, 3.303s, 3.288s

or: select count(1) from test_tbl34 where id = 1 or d != 'axx' or d != 'cxx' or d != 'zxx';
优化前:2.611s,2.765s, 2.484s
优化后:2.3s, 2.409s, 2.283s;

get_struct_field: select count(1) from test_tbl31 where d.e.d.y = 'y123';
优化前:80.819s, 80.517s,83.3s
优化后:80.885s, 80.108s, 81.121s;

.internal()
.doc("Collapse nested functions as one for optimization.")
.stringConf
.createWithDefault("get_struct_field,get_json_object");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about and, or

@@ -97,7 +97,10 @@ object ExpressionConverter extends SQLConfHelper with Logging {
if (udf.udfName.isEmpty) {
throw new GlutenNotSupportException("UDF name is not found!")
}
val substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get)
var substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collapsedFunctionsMap和udf有什么关系?看起来逻辑上没必要耦合在一块,最好能解耦

import org.apache.spark.sql.execution.SparkPlan
import org.apache.spark.sql.types.{DataType, DataTypes}

case class CollapseNestedExpressions(spark: SparkSession) extends Rule[SparkPlan] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add necessary comment

}

private def canBeOptimized(plan: SparkPlan): Boolean = plan match {
case p: ProjectExecTransformer =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can expression in generate operator be optimized ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems can not

@@ -462,8 +463,55 @@ class GetJsonObjectImpl

static size_t getNumberOfIndexArguments(const DB::ColumnsWithTypeAndName & arguments) { return arguments.size() - 1; }

bool insertResultToColumn(DB::IColumn & dest, typename JSONParser::Element & root, std::vector<std::shared_ptr<DB::GeneratorJSONPath<JSONParser>>> & generator_json_paths, size_t & json_path_pos) const
Copy link
Contributor

@taiyang-li taiyang-li Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgbo-ustc pls review changes related to get_json_object

@@ -59,9 +59,9 @@ class GetJSONObjectParser : public FunctionParser
DB::ActionsDAG & actions_dag) const override
{
const auto & args = substrait_func.arguments();
if (args.size() != 2)
if (args.size() < 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_json_object(get_json_object(d, '$.a'), '$.b') => optimize to get_json_object(d, '$.a', '$.b'), which may have more than 2 arguments.

@lgbo-ustc
Copy link
Contributor

lgbo-ustc commented Feb 12, 2025

解释下get_json_object的优化思路,不是太明白为何在这里要改成多个路径参数。
已经有 #8305 优化的情况下,这个改动带来哪些变化?

@lgbo-ustc
Copy link
Contributor

建议不同函数的优化拆分到不同的PR里面

mutable size_t total_normalized_rows = 0;

template<typename JSONParser, typename JSONStringSerializer>
void insertResultToColumn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be complex, I guess there should be a simpler implement with less branches

Copy link
Contributor

@lgbo-ustc lgbo-ustc Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should explain which case it is for each branch

Comment on lines 801 to 802
const char * query_begin = reinterpret_cast<const char *>(sub_field.c_str());
const char * query_end = sub_field.c_str() + sub_field.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not used the normalized one?

@KevinyhZou
Copy link
Contributor Author

KevinyhZou commented Feb 13, 2025

解释下get_json_object的优化思路,不是太明白为何在这里要改成多个路径参数。 已经有 #8305 优化的情况下,这个改动带来哪些变化?

#8305 pr 有些corner case解决不了,这个pr 可以解决, 兼容性更好一些。另外,嵌套调用优化都可以走这套流程,形式上更加统一一点 @lgbo-ustc

优化思路:对于嵌套的get_json_object(get_json_object(d, '$.a'), '$.b') 来讲,会经历两次get_json_object 函数调用:意味着需要经过两次parse json的操作;

优化后,变成get_json_object(d, '$.a', '$.b') 只需要一次get_json_object 调用,减少了函数调用次数,减少了parse json的次数到一次,直接可以从第一个路径parse的结果中,再进行第二个路径的查找。

为了在一次get_json_object 中获取到结果,所以需要将嵌套的get_json_object 的嵌套路径 都放到 get_json_object 函数中传递下去,就变成了多个路径。

Comment on lines 88 to 105
String getJsonPathOfGetJSONObject(const substrait::Expression_ScalarFunction & func)
{
String json_path = "";
for (size_t i = 1; i < func.arguments().size(); ++i)
{
auto json_path_pb = func.arguments(i).value();
if (!json_path_pb.has_literal() || !json_path_pb.literal().has_string())
{
break;
}
json_path += json_path_pb.literal().string();
if (i != func.arguments().size() - 1)
{
json_path += "#" ;
}
}
return json_path;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When things become complex, I don't think this representation could be flexible enogh any more. Let's consider the case with three levels of nested calls. A tree structure should be OK. For example a json text as following

[
  "path1",
  "path2",
  "path3":[
     "path3_1",
     "path3_2": [ ....]
  ],
  ...
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about more levels of netsted calls here? The disccussion should be given

Copy link
Contributor Author

@KevinyhZou KevinyhZou Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If more levels of nested get_json_objectcall mixed with other get_json_object call, like get_json_object(get_json_object(get_json_object(d, '$.a'), '$.b'), '$.c'), get_json_object(d, '$.e') and this will be optimized to get_json_object(d, '$.a', '$.b', '$.c'), get_json_object(d, '$.e') by the optimized rule CollapsedNestedExpressions, at next steps, the path of them will be converted to '$.a#$.b#$.c | $.e', and it will be executed by SparkFunctionGetJsonObject to return a tuple eventually, So more levels case can be handled here. @lgbo-ustc

I will and a ut for this case

@lgbo-ustc
Copy link
Contributor

性能测试

性能测试:测试数据量6千万行,分别执行三次,查看端到端执行时间

get_json_object: select count(1) from test_tbl1 where get_json_object(get_json_object(d, '$.a'), '$.b') = 'c';
优化前:53.194s, 52.18s, 51.603s
优化后:24.894s, 25.042s, 24.724s

and: select count(1) from test_tbl34 where id = 1 and d != 'axx' and d != 'cxx' and d != 'zxx';
优化前: 3.918s,4.106s,3.835s
优化后: 3.123s, 3.303s, 3.288s

or: select count(1) from test_tbl34 where id = 1 or d != 'axx' or d != 'cxx' or d != 'zxx';
优化前:2.611s,2.765s, 2.484s
优化后:2.3s, 2.409s, 2.283s;

get_struct_field: select count(1) from test_tbl31 where d.e.d.y = 'y123';
优化前:80.819s, 80.517s,83.3s
优化后:80.885s, 80.108s, 81.121s;

For get_struct_field, the coalesce is not necessary. get_struct_field is just to extract one of the nested column, the cost is tiny for a batch. I prefer to keep it not changed

@KevinyhZou
Copy link
Contributor Author

Yes, I will remove the changes of get_struct_field.

Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou KevinyhZou changed the title [GLUTEN-8557][CH]Optimize nested function calls for And/Or/GetStructField/GetJsonObject [GLUTEN-8557][CH]Optimize nested function calls for And/Or Feb 14, 2025
Copy link

Run Gluten Clickhouse CI on x86

2 similar comments
Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@@ -142,6 +160,8 @@ object ExpressionConverter extends SQLConfHelper with Logging {
expr match {
case p: PythonUDF =>
return replacePythonUDFWithExpressionTransformer(p, attributeSeq, expressionsMap)
case s: ScalaUDF if CollapsedExpressionMappings.supported(s.udfName.getOrElse("")) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does any scala udf expression need to be collapsed?

Copy link
Contributor Author

@KevinyhZou KevinyhZou Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not support scala udf now, but it use scala udf to convert nested expressions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PHILO-HE can you help review code changes under gluten/substrait. In theory, it doesn't have impact on the behavior of velox backend.

Copy link
Contributor

@taiyang-li taiyang-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
Copy link

Run Gluten Clickhouse CI on x86

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work!

.internal()
.doc("Collapse nested functions as one for optimization.")
.stringConf
.createWithDefault("and,or");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KevinyhZou, do you find any unsuitable corner cases? E.g., wrong result, performance degradation. If no, can we always enable this optimization without introducing a config?

Copy link
Contributor Author

@KevinyhZou KevinyhZou Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not found unsuitable case for the ci testing and my own testing. But I'm afraid there maybe some unsuitable case in our online sqls, which the ci testing do not cover,So I think it‘s better to keep them

* ScalaUDF at first, and then pass the ScalaUDF to clickhouse backend. e.g. And(And(a=1,
* b=2),c=3) can be optimized to And(a=1, b=2, c=3),but And(a=1, b=2, c=3) can not be
* supported by spark `And` function, so we need to convert it to ScalaUDF, with name is
* `And`, and have 3 arguments, when pass the `ScalaUDF(#and(a=1,b=2,c=3))` to clickhouse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must we need a ScalaUDF? Can we make an optimized plan just compatible with the backend? Regardless of that it's not compatible with Spark.

@PHILO-HE PHILO-HE changed the title [GLUTEN-8557][CH]Optimize nested function calls for And/Or [GLUTEN-8557][CH] Collapse nested function calls for And/Or for performance optimization Feb 21, 2025
Copy link

github-actions bot commented Mar 3, 2025

Run Gluten Clickhouse CI on x86

2 similar comments
Copy link

github-actions bot commented Mar 3, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Mar 3, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Mar 4, 2025

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLICKHOUSE CORE works for Gluten Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH]Optimize nested function calls
4 participants