[GLUTEN-8557][CH] Collapse nested function calls for `And`/`Or` for performance optimization #8558

KevinyhZou · 2025-01-17T09:10:57Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

(Fixes: #8557)

How was this patch tested?

test by ut

github-actions · 2025-01-17T09:11:16Z

#8557

github-actions · 2025-01-17T09:11:29Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-21T10:07:07Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-21T11:06:58Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-02-04T08:32:27Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-02-04T08:34:25Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-02-05T11:19:20Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-02-10T01:47:46Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-02-10T01:48:58Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-02-11T03:28:49Z

Run Gluten Clickhouse CI on x86

KevinyhZou · 2025-02-11T09:08:48Z

性能测试

性能测试：测试数据量6千万行，分别执行三次，查看端到端执行时间

get_json_object: select count(1) from test_tbl1 where get_json_object(get_json_object(d, '$.a'), '$.b') = 'c';
优化前：53.194s, 52.18s, 51.603s
优化后：24.894s, 25.042s, 24.724s

and: select count(1) from test_tbl34 where id = 1 and d != 'axx' and d != 'cxx' and d != 'zxx';
优化前： 3.918s，4.106s，3.835s
优化后： 3.123s, 3.303s, 3.288s

or： select count(1) from test_tbl34 where id = 1 or d != 'axx' or d != 'cxx' or d != 'zxx';
优化前：2.611s，2.765s， 2.484s
优化后：2.3s, 2.409s, 2.283s;

get_struct_field: select count(1) from test_tbl31 where d.e.d.y = 'y123';
优化前：80.819s， 80.517s，83.3s
优化后：80.885s, 80.108s, 81.121s;

taiyang-li · 2025-02-12T02:14:10Z

shims/common/src/main/scala/org/apache/gluten/config/GlutenConfig.scala

+      .internal()
+      .doc("Collapse nested functions as one for optimization.")
+      .stringConf
+      .createWithDefault("get_struct_field,get_json_object");


what about and, or

taiyang-li · 2025-02-12T02:16:47Z

gluten-substrait/src/main/scala/org/apache/gluten/expression/ExpressionConverter.scala

@@ -97,7 +97,10 @@ object ExpressionConverter extends SQLConfHelper with Logging {
    if (udf.udfName.isEmpty) {
      throw new GlutenNotSupportException("UDF name is not found!")
    }
-    val substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get)
+    var substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get)


collapsedFunctionsMap和udf有什么关系？看起来逻辑上没必要耦合在一块，最好能解耦

taiyang-li · 2025-02-12T02:19:17Z

backends-clickhouse/src/main/scala/org/apache/gluten/extension/CollapseNestedExpressions.scala

+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.types.{DataType, DataTypes}
+
+case class CollapseNestedExpressions(spark: SparkSession) extends Rule[SparkPlan] {


add necessary comment

taiyang-li · 2025-02-12T02:20:28Z

backends-clickhouse/src/main/scala/org/apache/gluten/extension/CollapseNestedExpressions.scala

+  }
+
+  private def canBeOptimized(plan: SparkPlan): Boolean = plan match {
+    case p: ProjectExecTransformer =>


can expression in generate operator be optimized ?

It seems can not

taiyang-li · 2025-02-12T02:32:56Z

cpp-ch/local-engine/Functions/SparkFunctionGetJsonObject.h

@@ -462,8 +463,55 @@ class GetJsonObjectImpl

    static size_t getNumberOfIndexArguments(const DB::ColumnsWithTypeAndName & arguments) { return arguments.size() - 1; }

+    bool insertResultToColumn(DB::IColumn & dest, typename JSONParser::Element & root, std::vector<std::shared_ptr<DB::GeneratorJSONPath<JSONParser>>> & generator_json_paths, size_t & json_path_pos) const


@lgbo-ustc pls review changes related to get_json_object

lgbo-ustc · 2025-02-12T06:03:46Z

cpp-ch/local-engine/Parser/scalar_function_parser/getJSONObject.cpp

@@ -59,9 +59,9 @@ class GetJSONObjectParser : public FunctionParser
        DB::ActionsDAG & actions_dag) const override
    {
        const auto & args = substrait_func.arguments();
-        if (args.size() != 2)
+        if (args.size() < 2)


get_json_object(get_json_object(d, '$.a'), '$.b') => optimize to get_json_object(d, '$.a', '$.b'), which may have more than 2 arguments.

lgbo-ustc · 2025-02-12T06:18:30Z

解释下get_json_object的优化思路，不是太明白为何在这里要改成多个路径参数。
已经有 #8305 优化的情况下，这个改动带来哪些变化？

lgbo-ustc · 2025-02-12T06:18:50Z

建议不同函数的优化拆分到不同的PR里面

lgbo-ustc · 2025-02-12T06:30:33Z

cpp-ch/local-engine/Functions/SparkFunctionGetJsonObject.h

+    mutable size_t total_normalized_rows = 0;
+
+    template<typename JSONParser, typename JSONStringSerializer>
+    void insertResultToColumn(


It seems to be complex, I guess there should be a simpler implement with less branches

Should explain which case it is for each branch

lgbo-ustc · 2025-02-12T06:46:28Z

cpp-ch/local-engine/Functions/SparkFunctionGetJsonObject.h

+                    const char * query_begin = reinterpret_cast<const char *>(sub_field.c_str());
+                    const char * query_end = sub_field.c_str() + sub_field.size();


why not used the normalized one?

KevinyhZou · 2025-02-13T01:43:08Z

解释下get_json_object的优化思路，不是太明白为何在这里要改成多个路径参数。已经有 #8305 优化的情况下，这个改动带来哪些变化？

#8305 pr 有些corner case解决不了，这个pr 可以解决，兼容性更好一些。另外，嵌套调用优化都可以走这套流程，形式上更加统一一点 @lgbo-ustc

优化思路：对于嵌套的get_json_object(get_json_object(d, '$.a'), '$.b') 来讲，会经历两次get_json_object 函数调用：意味着需要经过两次parse json的操作；

优化后，变成get_json_object(d, '$.a', '$.b') 只需要一次get_json_object 调用，减少了函数调用次数，减少了parse json的次数到一次，直接可以从第一个路径parse的结果中，再进行第二个路径的查找。

为了在一次get_json_object 中获取到结果，所以需要将嵌套的get_json_object 的嵌套路径都放到 get_json_object 函数中传递下去，就变成了多个路径。

lgbo-ustc · 2025-02-13T06:53:46Z

cpp-ch/local-engine/Rewriter/ExpressionRewriter.h

+    String getJsonPathOfGetJSONObject(const substrait::Expression_ScalarFunction & func)
+    {
+        String json_path = "";
+        for (size_t i = 1; i < func.arguments().size(); ++i)
+        {
+            auto json_path_pb = func.arguments(i).value();
+            if (!json_path_pb.has_literal() || !json_path_pb.literal().has_string())
+            {
+                break;
+            }
+            json_path += json_path_pb.literal().string();
+            if (i != func.arguments().size() - 1)
+            {
+                json_path += "#" ;
+            }
+        }
+        return json_path;
+    }


When things become complex, I don't think this representation could be flexible enogh any more. Let's consider the case with three levels of nested calls. A tree structure should be OK. For example a json text as following

[ "path1", "path2", "path3":[ "path3_1", "path3_2": [ ....] ], ... ]

How about more levels of netsted calls here? The disccussion should be given

If more levels of nested get_json_objectcall mixed with other get_json_object call, like get_json_object(get_json_object(get_json_object(d, '$.a'), '$.b'), '$.c'), get_json_object(d, '$.e') and this will be optimized to get_json_object(d, '$.a', '$.b', '$.c'), get_json_object(d, '$.e') by the optimized rule CollapsedNestedExpressions, at next steps, the path of them will be converted to '$.a#$.b#$.c | $.e', and it will be executed by SparkFunctionGetJsonObject to return a tuple eventually, So more levels case can be handled here. @lgbo-ustc

I will and a ut for this case

lgbo-ustc · 2025-02-13T07:16:59Z

性能测试

性能测试：测试数据量6千万行，分别执行三次，查看端到端执行时间

get_json_object: select count(1) from test_tbl1 where get_json_object(get_json_object(d, '$.a'), '$.b') = 'c';
优化前：53.194s, 52.18s, 51.603s
优化后：24.894s, 25.042s, 24.724s

and: select count(1) from test_tbl34 where id = 1 and d != 'axx' and d != 'cxx' and d != 'zxx';
优化前： 3.918s，4.106s，3.835s
优化后： 3.123s, 3.303s, 3.288s

or： select count(1) from test_tbl34 where id = 1 or d != 'axx' or d != 'cxx' or d != 'zxx';
优化前：2.611s，2.765s， 2.484s
优化后：2.3s, 2.409s, 2.283s;

get_struct_field: select count(1) from test_tbl31 where d.e.d.y = 'y123';
优化前：80.819s， 80.517s，83.3s
优化后：80.885s, 80.108s, 81.121s;

For get_struct_field, the coalesce is not necessary. get_struct_field is just to extract one of the nested column, the cost is tiny for a batch. I prefer to keep it not changed

KevinyhZou · 2025-02-13T07:18:39Z

Yes, I will remove the changes of get_struct_field.

github-actions · 2025-02-14T11:09:44Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-02-14T11:12:54Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-02-17T03:32:04Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-02-19T02:44:49Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-02-19T03:24:55Z

Run Gluten Clickhouse CI on x86

taiyang-li · 2025-02-19T06:16:42Z

gluten-substrait/src/main/scala/org/apache/gluten/expression/ExpressionConverter.scala

@@ -142,6 +160,8 @@ object ExpressionConverter extends SQLConfHelper with Logging {
    expr match {
      case p: PythonUDF =>
        return replacePythonUDFWithExpressionTransformer(p, attributeSeq, expressionsMap)
+      case s: ScalaUDF if CollapsedExpressionMappings.supported(s.udfName.getOrElse("")) =>


does any scala udf expression need to be collapsed？

not support scala udf now, but it use scala udf to convert nested expressions.

@PHILO-HE can you help review code changes under gluten/substrait. In theory, it doesn't have impact on the behavior of velox backend.

taiyang-li

LGTM

github-actions · 2025-02-19T07:02:20Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-02-19T10:37:04Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-02-19T10:42:14Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-02-19T10:45:41Z

Run Gluten Clickhouse CI on x86

PHILO-HE

Thanks for the work!

PHILO-HE · 2025-02-21T05:21:23Z

shims/common/src/main/scala/org/apache/gluten/config/GlutenConfig.scala

+      .internal()
+      .doc("Collapse nested functions as one for optimization.")
+      .stringConf
+      .createWithDefault("and,or");


@KevinyhZou, do you find any unsuitable corner cases? E.g., wrong result, performance degradation. If no, can we always enable this optimization without introducing a config?

Not found unsuitable case for the ci testing and my own testing. But I'm afraid there maybe some unsuitable case in our online sqls, which the ci testing do not cover，So I think it‘s better to keep them

PHILO-HE · 2025-02-21T05:55:05Z

backends-clickhouse/src/main/scala/org/apache/gluten/extension/CollapseNestedExpressions.scala

+       * ScalaUDF at first, and then pass the ScalaUDF to clickhouse backend. e.g. And(And(a=1,
+       * b=2),c=3) can be optimized to And(a=1, b=2, c=3)，but And(a=1, b=2, c=3) can not be
+       * supported by spark `And` function, so we need to convert it to ScalaUDF, with name is
+       * `And`, and have 3 arguments, when pass the `ScalaUDF(#and(a=1,b=2,c=3))` to clickhouse


Must we need a ScalaUDF? Can we make an optimized plan just compatible with the backend? Regardless of that it's not compatible with Spark.

github-actions · 2025-03-03T12:37:17Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-03-03T12:39:11Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-03-03T12:50:20Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-03-04T12:02:56Z

Run Gluten Clickhouse CI on x86

KevinyhZou marked this pull request as draft January 17, 2025 09:11

github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Jan 17, 2025

KevinyhZou changed the title ~~[GLUTEN-8557][CH]Optimize nested function calls~~ [GLUTEN-8557][CH]Optimize nested function calls for And, GetStructField Jan 17, 2025

KevinyhZou changed the title ~~[GLUTEN-8557][CH]Optimize nested function calls for And, GetStructField~~ [GLUTEN-8557][CH]Optimize nested function calls for And/GetStructField Jan 17, 2025

KevinyhZou changed the title ~~[GLUTEN-8557][CH]Optimize nested function calls for And/GetStructField~~ [GLUTEN-8557][CH]Optimize nested function calls for And/Or/GetStructField/GetJsonObject Jan 21, 2025

KevinyhZou closed this Jan 21, 2025

KevinyhZou reopened this Jan 21, 2025

KevinyhZou marked this pull request as ready for review January 21, 2025 11:06

KevinyhZou force-pushed the opt_nested_funcs branch from b86d9e7 to 2cc64f2 Compare February 4, 2025 08:31

taiyang-li reviewed Feb 12, 2025

View reviewed changes

lgbo-ustc reviewed Feb 12, 2025

View reviewed changes

lgbo-ustc requested changes Feb 12, 2025

View reviewed changes

lgbo-ustc requested changes Feb 13, 2025

View reviewed changes

KevinyhZou changed the title ~~[GLUTEN-8557][CH]Optimize nested function calls for And/Or/GetStructField/GetJsonObject~~ [GLUTEN-8557][CH]Optimize nested function calls for And/Or Feb 14, 2025

KevinyhZou force-pushed the opt_nested_funcs branch from 93dcc08 to a9e0512 Compare February 19, 2025 03:24

taiyang-li reviewed Feb 19, 2025

View reviewed changes

taiyang-li approved these changes Feb 19, 2025

View reviewed changes

KevinyhZou force-pushed the opt_nested_funcs branch from 9ad3176 to 6e59310 Compare February 19, 2025 10:41

PHILO-HE reviewed Feb 21, 2025

View reviewed changes

PHILO-HE changed the title ~~[GLUTEN-8557][CH]Optimize nested function calls for And/Or~~ [GLUTEN-8557][CH] Collapse nested function calls for And/Or for performance optimization Feb 21, 2025

optimize nested function calls

72857c0

KevinyhZou force-pushed the opt_nested_funcs branch from ba9497b to 72857c0 Compare March 4, 2025 12:02

		@@ -462,8 +463,55 @@ class GetJsonObjectImpl

		static size_t getNumberOfIndexArguments(const DB::ColumnsWithTypeAndName & arguments) { return arguments.size() - 1; }

		bool insertResultToColumn(DB::IColumn & dest, typename JSONParser::Element & root, std::vector<std::shared_ptr<DB::GeneratorJSONPath<JSONParser>>> & generator_json_paths, size_t & json_path_pos) const

		const char * query_begin = reinterpret_cast<const char *>(sub_field.c_str());
		const char * query_end = sub_field.c_str() + sub_field.size();

[GLUTEN-8557][CH] Collapse nested function calls for And/Or for performance optimization #8558

Are you sure you want to change the base?

[GLUTEN-8557][CH] Collapse nested function calls for And/Or for performance optimization #8558

Conversation

KevinyhZou commented Jan 17, 2025 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Jan 17, 2025

github-actions bot commented Jan 17, 2025

github-actions bot commented Jan 21, 2025

github-actions bot commented Jan 21, 2025

github-actions bot commented Feb 4, 2025

github-actions bot commented Feb 4, 2025

github-actions bot commented Feb 5, 2025

github-actions bot commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

github-actions bot commented Feb 11, 2025

KevinyhZou commented Feb 11, 2025

性能测试

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taiyang-li Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgbo-ustc commented Feb 12, 2025 • edited Loading

lgbo-ustc commented Feb 12, 2025

Choose a reason for hiding this comment

lgbo-ustc Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KevinyhZou commented Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KevinyhZou Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

lgbo-ustc commented Feb 13, 2025

性能测试

KevinyhZou commented Feb 13, 2025

github-actions bot commented Feb 14, 2025

github-actions bot commented Feb 14, 2025

github-actions bot commented Feb 17, 2025

github-actions bot commented Feb 19, 2025

github-actions bot commented Feb 19, 2025

Choose a reason for hiding this comment

KevinyhZou Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taiyang-li left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 19, 2025

github-actions bot commented Feb 19, 2025

github-actions bot commented Feb 19, 2025

github-actions bot commented Feb 19, 2025

PHILO-HE left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KevinyhZou Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Mar 3, 2025

github-actions bot commented Mar 3, 2025

github-actions bot commented Mar 3, 2025

github-actions bot commented Mar 4, 2025

[GLUTEN-8557][CH] Collapse nested function calls for `And`/`Or` for performance optimization #8558

[GLUTEN-8557][CH] Collapse nested function calls for `And`/`Or` for performance optimization #8558

KevinyhZou commented Jan 17, 2025 •

edited

Loading

taiyang-li Feb 12, 2025 •

edited

Loading

lgbo-ustc commented Feb 12, 2025 •

edited

Loading

lgbo-ustc Feb 12, 2025 •

edited

Loading

KevinyhZou commented Feb 13, 2025 •

edited

Loading

KevinyhZou Feb 14, 2025 •

edited

Loading

KevinyhZou Feb 19, 2025 •

edited

Loading

KevinyhZou Feb 24, 2025 •

edited

Loading