Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-5074][VL] fix: UDF load error in yarn-cluster mode #5075

Merged
merged 7 commits into from
Mar 25, 2024

Conversation

kecookier
Copy link
Contributor

@kecookier kecookier commented Mar 21, 2024

What changes were proposed in this pull request?

  1. Remove driverUdfLibPath and retain only udfLibPath, as there is no need to differentiate between the driver and executor. The method used to access the file is determined by whether --master=yarn is specified.

For more details on the reason, refer to Issue link #5074

  1. The UDF is loaded by VeloxBackend::initUdf in both the driver and the executor. UdfResolver does not load the UDF repeatedly; it only retrieves the function signatures.

(Fixes: #5074)

How was this patch tested?

I tested with and without the --files/--archives arguments in local, yarn-client, and yarn-cluster modes.

Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@kecookier kecookier changed the title [VL] Fix UDF load error in yarn-cluster mode [GLUTEN-5074][VL] fix: UDF load error in yarn-cluster mode Mar 21, 2024
Copy link

#5074

@zhouyuan zhouyuan requested a review from marin-ma March 22, 2024 00:34
if (!canAccessSparkFiles) {
throw new IllegalArgumentException(
"On yarn-client mode, driver only accepts absolute paths, but got " + f)
val uri = Utils.resolveURI(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like f can also be a relative filename or file tag here without a scheme. Would this logic be able to handle such case?

e.g.

--files /path/to/gluten/cpp/build/velox/udf/examples/libmyudf.so
--conf spark.gluten.sql.columnar.backend.velox.udfLibraryPaths=libmyudf.so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,if f is libmyudf.so, Utils.resolveURI() will return file://${PWD}/libmyudf.so

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But file://${PWD}/libmyudf.so is not the expected path, right? It should be /path/to/gluten/cpp/build/velox/udf/examples/libmyudf.so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to Utils.resolveURI() is used to determine whether the file is local or remote.

In your example, libmyudf.so is a local file, and we will not use uri.path directly.

The --files argument copies this file to a different destination directory. When --master=yarn is specified, the file is copied to the working directory on all nodes (both the driver and executors). In local mode, the files are added using SparkContext.addFile, and they can then be accessed using SparkFiles.get(f).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example, libmyudf.so is a local file

That doesn't make sense to me. libmyudf.so here should refer to the file uploaded by --files. But here it's resolved as a relative path to the runtime directory.

The call to Utils.resolveURI() is used to determine whether the file is local or remote.

It would be better to check whether it's a relative path first. Although it's resolved as a local file and pass the if condition, the path doesn't even exist.

When --master=yarn is specified, the file is copied to the working directory on all nodes (both the driver and executors).

Based on my previous experience, for yarn client mode, the files will be copied to all executor container + AM container, so they won't be copied to the driver node. In this case, if we only use the two configurations below

--files /path/to/gluten/cpp/build/velox/udf/examples/libmyudf.so
--conf spark.gluten.sql.columnar.backend.velox.udfLibraryPaths=libmyudf.so

the driver will fail to get the actual path for libmyudf.so, and that's the reason for adding spark.gluten.sql.columnar.backend.velox.driver.udfLibraryPaths

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your patient explanation, Rong!

I previously thought that under yarn-client mode, spark.yarn.dist.files would also copy the files to the driver. I got the AM and driver mixed up.

Let's fix this issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest preserving 2 kinds of configuration for spark.gluten.sql.columnar.backend.velox.udfLibraryPaths

  • relative path
  • URI

And make `spark.gluten.sql.columnar.backend.velox.udfLibraryPaths=/path/to/..." invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marin-ma I have already fix the code according to the previous discussion in the comments, could you help review it again?

@kecookier kecookier force-pushed the vec-udf branch 2 times, most recently from 4c3a092 to 76bde19 Compare March 23, 2024 14:06
@@ -82,7 +82,8 @@ class VeloxUdfSuiteLocal extends VeloxUdfSuite {
override val master: String = "local[2]"
override protected def sparkConf: SparkConf = {
super.sparkConf
.set("spark.gluten.sql.columnar.backend.velox.udfLibraryPaths", udfLibPath)
.set("spark.files", udfLibPath)
.set("spark.gluten.sql.columnar.backend.velox.udfLibraryPaths", "libmyudf.so")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not hard code "libmyudf.so" here. Use a helper function to extract the filename from udfLibPath. Note the udfLibPath can also be a comma-separated string with multiple paths.

@marin-ma
Copy link
Contributor

Could you also update the examples in the document with either relative or URI path?

Copy link
Contributor

@marin-ma marin-ma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@marin-ma marin-ma merged commit cb63b38 into apache:main Mar 25, 2024
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[VL] UDF load failed in yarn-cluster mode
2 participants