-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-8557][CH] Collapse nested function calls for And
/Or
for performance optimization
#8558
base: main
Are you sure you want to change the base?
Conversation
Run Gluten Clickhouse CI on x86 |
And
, GetStructField
And
, GetStructField
And
/GetStructField
Run Gluten Clickhouse CI on x86 |
And
/GetStructField
And
/Or
/GetStructField
/GetJsonObject
Run Gluten Clickhouse CI on x86 |
b86d9e7
to
2cc64f2
Compare
Run Gluten Clickhouse CI on x86 |
5 similar comments
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
性能测试
|
.internal() | ||
.doc("Collapse nested functions as one for optimization.") | ||
.stringConf | ||
.createWithDefault("get_struct_field,get_json_object"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about and
, or
@@ -97,7 +97,10 @@ object ExpressionConverter extends SQLConfHelper with Logging { | |||
if (udf.udfName.isEmpty) { | |||
throw new GlutenNotSupportException("UDF name is not found!") | |||
} | |||
val substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get) | |||
var substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collapsedFunctionsMap和udf有什么关系?看起来逻辑上没必要耦合在一块,最好能解耦
import org.apache.spark.sql.execution.SparkPlan | ||
import org.apache.spark.sql.types.{DataType, DataTypes} | ||
|
||
case class CollapseNestedExpressions(spark: SparkSession) extends Rule[SparkPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add necessary comment
} | ||
|
||
private def canBeOptimized(plan: SparkPlan): Boolean = plan match { | ||
case p: ProjectExecTransformer => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can expression in generate operator be optimized ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems can not
@@ -462,8 +463,55 @@ class GetJsonObjectImpl | |||
|
|||
static size_t getNumberOfIndexArguments(const DB::ColumnsWithTypeAndName & arguments) { return arguments.size() - 1; } | |||
|
|||
bool insertResultToColumn(DB::IColumn & dest, typename JSONParser::Element & root, std::vector<std::shared_ptr<DB::GeneratorJSONPath<JSONParser>>> & generator_json_paths, size_t & json_path_pos) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgbo-ustc pls review changes related to get_json_object
@@ -59,9 +59,9 @@ class GetJSONObjectParser : public FunctionParser | |||
DB::ActionsDAG & actions_dag) const override | |||
{ | |||
const auto & args = substrait_func.arguments(); | |||
if (args.size() != 2) | |||
if (args.size() < 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_json_object(get_json_object(d, '$.a'), '$.b') => optimize to get_json_object(d, '$.a', '$.b')
, which may have more than 2 arguments.
解释下get_json_object的优化思路,不是太明白为何在这里要改成多个路径参数。 |
建议不同函数的优化拆分到不同的PR里面 |
mutable size_t total_normalized_rows = 0; | ||
|
||
template<typename JSONParser, typename JSONStringSerializer> | ||
void insertResultToColumn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be complex, I guess there should be a simpler implement with less branches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should explain which case it is for each branch
const char * query_begin = reinterpret_cast<const char *>(sub_field.c_str()); | ||
const char * query_end = sub_field.c_str() + sub_field.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not used the normalized one?
#8305 pr 有些corner case解决不了,这个pr 可以解决, 兼容性更好一些。另外,嵌套调用优化都可以走这套流程,形式上更加统一一点 @lgbo-ustc 优化思路:对于嵌套的get_json_object(get_json_object(d, '$.a'), '$.b') 来讲,会经历两次get_json_object 函数调用:意味着需要经过两次parse json的操作; 优化后,变成get_json_object(d, '$.a', '$.b') 只需要一次get_json_object 调用,减少了函数调用次数,减少了parse json的次数到一次,直接可以从第一个路径parse的结果中,再进行第二个路径的查找。 为了在一次get_json_object 中获取到结果,所以需要将嵌套的get_json_object 的嵌套路径 都放到 get_json_object 函数中传递下去,就变成了多个路径。 |
String getJsonPathOfGetJSONObject(const substrait::Expression_ScalarFunction & func) | ||
{ | ||
String json_path = ""; | ||
for (size_t i = 1; i < func.arguments().size(); ++i) | ||
{ | ||
auto json_path_pb = func.arguments(i).value(); | ||
if (!json_path_pb.has_literal() || !json_path_pb.literal().has_string()) | ||
{ | ||
break; | ||
} | ||
json_path += json_path_pb.literal().string(); | ||
if (i != func.arguments().size() - 1) | ||
{ | ||
json_path += "#" ; | ||
} | ||
} | ||
return json_path; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When things become complex, I don't think this representation could be flexible enogh any more. Let's consider the case with three levels of nested calls. A tree structure should be OK. For example a json text as following
[
"path1",
"path2",
"path3":[
"path3_1",
"path3_2": [ ....]
],
...
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about more levels of netsted calls here? The disccussion should be given
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If more levels of nested get_json_object
call mixed with other get_json_object
call, like get_json_object(get_json_object(get_json_object(d, '$.a'), '$.b'), '$.c'), get_json_object(d, '$.e')
and this will be optimized to get_json_object(d, '$.a', '$.b', '$.c'), get_json_object(d, '$.e')
by the optimized rule CollapsedNestedExpressions
, at next steps, the path of them will be converted to '$.a#$.b#$.c | $.e', and it will be executed by SparkFunctionGetJsonObject
to return a tuple eventually, So more levels case can be handled here. @lgbo-ustc
I will and a ut for this case
For |
Yes, I will remove the changes of get_struct_field. |
Run Gluten Clickhouse CI on x86 |
And
/Or
/GetStructField
/GetJsonObject
And
/Or
Run Gluten Clickhouse CI on x86 |
2 similar comments
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
93dcc08
to
a9e0512
Compare
Run Gluten Clickhouse CI on x86 |
@@ -142,6 +160,8 @@ object ExpressionConverter extends SQLConfHelper with Logging { | |||
expr match { | |||
case p: PythonUDF => | |||
return replacePythonUDFWithExpressionTransformer(p, attributeSeq, expressionsMap) | |||
case s: ScalaUDF if CollapsedExpressionMappings.supported(s.udfName.getOrElse("")) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does any scala udf expression need to be collapsed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not support scala udf now, but it use scala udf to convert nested expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PHILO-HE can you help review code changes under gluten/substrait. In theory, it doesn't have impact on the behavior of velox backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Run Gluten Clickhouse CI on x86 |
1 similar comment
Run Gluten Clickhouse CI on x86 |
9ad3176
to
6e59310
Compare
Run Gluten Clickhouse CI on x86 |
1 similar comment
Run Gluten Clickhouse CI on x86 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work!
.internal() | ||
.doc("Collapse nested functions as one for optimization.") | ||
.stringConf | ||
.createWithDefault("and,or"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KevinyhZou, do you find any unsuitable corner cases? E.g., wrong result, performance degradation. If no, can we always enable this optimization without introducing a config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not found unsuitable case for the ci testing and my own testing. But I'm afraid there maybe some unsuitable case in our online sqls, which the ci testing do not cover,So I think it‘s better to keep them
* ScalaUDF at first, and then pass the ScalaUDF to clickhouse backend. e.g. And(And(a=1, | ||
* b=2),c=3) can be optimized to And(a=1, b=2, c=3),but And(a=1, b=2, c=3) can not be | ||
* supported by spark `And` function, so we need to convert it to ScalaUDF, with name is | ||
* `And`, and have 3 arguments, when pass the `ScalaUDF(#and(a=1,b=2,c=3))` to clickhouse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must we need a ScalaUDF? Can we make an optimized plan just compatible with the backend? Regardless of that it's not compatible with Spark.
And
/Or
And
/Or
for performance optimization
Run Gluten Clickhouse CI on x86 |
2 similar comments
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
ba9497b
to
72857c0
Compare
Run Gluten Clickhouse CI on x86 |
What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
(Fixes: #8557)
How was this patch tested?
test by ut