Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-8379][VL] Support query trace #8380

Merged
merged 1 commit into from
Jan 22, 2025

Conversation

jinchengchenghh
Copy link
Contributor

@jinchengchenghh jinchengchenghh commented Dec 31, 2024

Run the MicroBenchmark to generate stage level plan and then enable query trace in benchmark to profile node level query. Benchmark with query trace enabled replaces ValueStreamNode which is hard to serialize to ValuesNode. This issue may be fixed by plan serialization optimization that only serializes the plan node to profile in velox query trace. facebookincubator/velox#12084

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Dec 31, 2024
Copy link

#8379

Copy link

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Jan 3, 2025

Run Gluten Clickhouse CI on x86

@zhli1142015
Copy link
Contributor

We are also looking into this. I tried it locally but encountered the error below. Would you fix it in your PR?

Caused by: org.apache.gluten.exception.GlutenException: Exception: VeloxUserError
Error Source: USER
Error Code: UNSUPPORTED
Reason: ValueStream plan node is not serializable
Retriable: False
Function: serialize
File: /var/git/incubator-gluten/cpp/velox/operators/plannodes/RowVectorStream.h
Line: 136
Stack trace:
# 0  facebook::velox::VeloxException::VeloxException(char const*, unsigned long, char const*, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, bool, facebook::velox::VeloxException::Type, std::basic_string_view<char, std::char_traits<char> >)
# 1  void facebook::velox::detail::veloxCheckFail<facebook::velox::VeloxUserError, char const*>(facebook::velox::detail::VeloxCheckFailArgs const&, char const*)
# 2  gluten::ValueStreamNode::serialize() const
# 3  facebook::velox::core::PlanNode::serialize() const
# 4  facebook::velox::core::TopNNode::serialize() const
# 5  facebook::velox::core::PlanNode::serialize() const
# 6  facebook::velox::core::ProjectNode::serialize() const
# 7  facebook::velox::exec::trace::TaskTraceMetadataWriter::write(std::shared_ptr<facebook::velox::core::QueryCtx> const&, std::shared_ptr<facebook::velox::core::PlanNode const> const&)
# 8  facebook::velox::exec::Task::maybeInitTrace()

@jinchengchenghh
Copy link
Contributor Author

No, I just add the config. The exception is because we don't support to serialize ValueStream Node to json, we could support it, it is not hard.

@jinchengchenghh
Copy link
Contributor Author

QueryTrace only supports some operators, we need to extend it.

@zhli1142015
Copy link
Contributor

I think we might need to convert the exception below into a warning log to tolerate the issue you mentioned, where only some operators are supported.
https://github.com/facebookincubator/velox/blob/9da5fa7ee803540e330bc626a57388c9cbcad1b3/velox/exec/Operator.cpp#L130

@zhli1142015
Copy link
Contributor

And we also need to register spark fucntions in velox_query_replayer.

@jinchengchenghh
Copy link
Contributor Author

If you don't set the trace config, we won't run into this exception. The exception indicates you could not enable query trace in that case.

Looks like we still have further more work to do to enable query trace in Gluten

@FelixYBW
Copy link
Contributor

FelixYBW commented Jan 3, 2025

@jinchengchenghh what's the difference/advantage from our microbenchmark tool?

@jinchengchenghh
Copy link
Contributor Author

Our benchmark accepts a total velox plan, but query trace can specify the plan node id, it can save the input of any plan node @FelixYBW

Copy link

github-actions bot commented Jan 6, 2025

Run Gluten Clickhouse CI on x86

@zhztheplayer
Copy link
Member

Our benchmark accepts a total velox plan, but query trace can specify the plan node id, it can save the input of any plan node @FelixYBW

Can we consolidate the two functionalities? They sound overlap more or less?

@FelixYBW
Copy link
Contributor

FelixYBW commented Jan 6, 2025

Can we consolidate the two functionalities? They sound overlap more or less?

If it can save the input of any plan node and reproduce the node only, it will be useful to debug.

@@ -558,6 +558,18 @@ std::unordered_map<std::string, std::string> WholeStageResultIterator::getQueryC
configs[velox::core::QueryConfig::kSparkLegacyDateFormatter] = "false";
}

const auto setIfExists = [&](const std::string& glutenKey, const std::string& veloxKey) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this func is common, should we declare it as static member function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good suggestion, I think the config needs to do the refactor, now if the config not set by java side, we will set a default value of velox config to velox query config, we should change to not set the config.

If velox config default value is changed, we can change automatic with them if we don't have different default value with velox.

Copy link

Run Gluten ClickHouse CI on ARM

@github-actions github-actions bot added the DOCS label Jan 15, 2025
Copy link

Run Gluten ClickHouse CI on ARM

2 similar comments
Copy link

Run Gluten ClickHouse CI on ARM

Copy link

Run Gluten ClickHouse CI on ARM

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible we finally get the generic benchmark and this feature merged in Gluten? Since both are invasive to the core execution code path. E.g., Generic benchmark requires for a individual parameter passed through JNI bridge and this feature requires for changing the ValueStreamNode to ValueNode. Do both features have own users as of now?

Comment on lines 54 to 61
std::string getQueryId(const std::unordered_map<std::string, std::string>& confMap) {
auto it = confMap.find(kQueryTraceQueryId);
if (it != confMap.end()) {
return it->second;
}
return "";
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little bit confused with the empty query ID. Can we always give a meaningful ID to Velox?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we uses the applicationId_stageId as the queryId?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we uses the applicationId_stageId as the queryId?

Do we have applicationId passed through JNI? Or could just align it with the task name somehow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If applicationId is important for the feature and not yet passed through JNI, could also add it and use it for query context and task names / IDs in another PR. Thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queryId concept is not really query id, it is velox query id, not Spark queryId, now one query for one task, so I think we can use the same name with task name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we don't pass the applicationId

VELOX_CHECK_LT(streamIdx, inputIters_.size(), "Could not find stream index {} in input iterator list.", streamIdx);
const auto iterator = inputIters_[streamIdx];
while (iterator->hasNext()) {
auto cb = VeloxColumnarBatch::from(defaultLeafVeloxMemoryPool().get(), iterator->next());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the memory managed by Spark? Given that defaultLeafVeloxMemoryPool is global.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used in benchmark, so we don't need to track it.

Comment on lines 1 to 6
---
layout: page
title: How To Use Gluten
nav_order: 1
parent: Developer Overview
---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header needs to update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh

nit: Would you update title (and perhaps other fields as well) in the header? Thanks.

@duanmeng
Copy link

@jinchengchenghh It is awesome to support query tracing in gluten. By the way, I am planning to support PlanNode serialization without leaf nodes to avoid serializing potential ValueStreamNode so that we can support more general query tracing in gluten.

Copy link

Run Gluten ClickHouse CI on ARM

@jinchengchenghh
Copy link
Contributor Author

The GenericBenchmark is frequently used by us CC @JkSelf , it is really usefully for debug in cpp side, we can use vscode to do step-by-step debug. And query trace is widely used in Velox to do correctness verification and performance profiling, so I think we need to enable it in Gluten. and with this optimization #8380 (comment), we could drop the ValueStreamNode to ValusNode replace. @zhztheplayer

@duanmeng
Copy link

duanmeng commented Jan 17, 2025

The GenericBenchmark is frequently used by us CC @JkSelf , it is really usefully for debug in cpp side, we can use vscode to do step-by-step debug. And query trace is widely used in Velox to do correctness verification and performance profiling, so I think we need to enable it in Gluten. and with this optimization #8380 (comment), we could drop the ValueStreamNode to ValusNode replace. @zhztheplayer

As discussed with @xiaoxmeng and @jinchengchenghh offline, we could add simple ValueStreamNode serde methods that serialize only id and output type and create a ValueStreamNode with empty ResultIterator. Because we do not need to trace the ValueStreamNode.

@jinchengchenghh
Copy link
Contributor Author

I tried the serialize ValueStreamNode as empty serialization, but failed by deserialization.

The plan node deserialization is registered in velox, but ValueStreamNode exists in gluten, so we cannot do that.
https://github.com/facebookincubator/velox/blob/main/velox/core/PlanNode.cpp#L2716

root@sr249:/mnt/DP_disk1/code/velox/build/velox/tool/trace# ./velox_query_replayer  --root_dir /tmp/query_trace --task_id Gluten_Stage_0_TID_0_VTID_0 --query_id=query_1 --node_id=7 --summary
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0117 20:08:46.359496 2923199 HiveConnector.cpp:56] Hive connector test-hive created with maximum of 20000 cached file handles.
terminate called after throwing an instance of 'facebook::velox::VeloxUserError'
  what():  Exception: VeloxUserError
Error Source: USER
Error Code: INVALID_ARGUMENT
Reason: Deserialization function for class: ValueStreamNode is not registered
Retriable: False
Expression: registry.Has(name)
Function: deserialize
File: /mnt/DP_disk1/code/velox/./velox/common/serialization/Serializable.h
Line: 196
Stack trace:
Stack trace has been disabled. Use --velox_exception_user_stacktrace_enabled=true to enable it

@jinchengchenghh
Copy link
Contributor Author

Can you help review again? Thanks! @zhztheplayer

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits. Thank you!

Comment on lines 1 to 6
---
layout: page
title: How To Use Gluten
nav_order: 1
parent: Developer Overview
---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh

nit: Would you update title (and perhaps other fields as well) in the header? Thanks.

.createWithDefault(false)

val QUERY_TRACE_DIR = buildConf("spark.gluten.sql.columnar.backend.velox.queryTraceDir")
.doc("Base dir of a query to store tracing data.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if one leaves this empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the config is not set correctly, Velox will throw exception, so as other config.

***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
W0120 15:09:59.768762 3040717 GenericBenchmark.cc:253] Setting CPU for thread 0 to 0
terminate called after throwing an instance of 'facebook::velox::VeloxUserError'
  what():  Exception: VeloxUserError
Error Source: USER
Error Code: INVALID_ARGUMENT
Reason: Query trace enabled but the trace dir is not set
Retriable: False
Expression: !queryConfig.queryTraceDir().empty()
Function: maybeMakeTraceConfig
File: /mnt/DP_disk1/code/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/Task.cpp
Line: 3027
Stack trace:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we set it with default /tmp/query_trace?
BTW, if we want expose the config to user, we need remove the internal()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config is only used in internal, user will not set it by this config name, because we need to enable query trace in benchmark, so user set it by FLAGS_XX in benchmark

Comment on lines 54 to 61
std::string getQueryId(const std::unordered_map<std::string, std::string>& confMap) {
auto it = confMap.find(kQueryTraceQueryId);
if (it != confMap.end()) {
return it->second;
}
return "";
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we uses the applicationId_stageId as the queryId?

Do we have applicationId passed through JNI? Or could just align it with the task name somehow.

Copy link

Run Gluten ClickHouse CI on ARM

Copy link

Run Gluten ClickHouse CI on ARM

@jinchengchenghh jinchengchenghh merged commit 57fa103 into apache:main Jan 22, 2025
49 of 50 checks passed
@FelixYBW
Copy link
Contributor

Thank you, Chengcheng for your effort!

baibaichen pushed a commit to baibaichen/gluten that referenced this pull request Feb 1, 2025
Run the MicroBenchmark to generate stage level plan and then enable query trace in benchmark to profile node level query. Benchmark with query trace enabled replaces ValueStreamNode which is hard to serialize to ValuesNode. This issue may be fixed by plan serialization optimization that only serializes the plan node to profile in velox query trace. facebookincubator/velox#12084
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CORE works for Gluten Core DOCS VELOX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants