how to print out GPUOverrides #9112
Replies: 8 comments 4 replies
-
Hi @jackie71111, sorry for the late reply. There could be a number of reasons why you are not seeing the warning messages for the explain. If you have steps to reproduce the issue that would be very helpful. In the meantime I have some guesses that may explain how this is happening:
|
Beta Was this translation helpful? Give feedback.
-
1, all config is enabled. and i can see the WARN messages about rapids enabled. In one notebook, so the env setting are same. |
Beta Was this translation helpful? Give feedback.
-
below is my spark configure : You need to update with your real hardware resourceSPARK_MASTER_URL = os.getenv("SPARK_MASTER_URL", "local[16]") Common spark settingsconf = SparkConf() The tasks will run on GPU memory, so there is no need to set a high host memoryconf.set("spark.executor.memory", executorMem) The tasks will run on GPU cores, so there is no need to use many cpu cores#conf.set("spark.executor.cores", 2) Plugin settings#conf.set("spark.executor.resource.gpu.amount", "1") 2 tasks will run concurrently per GPUconf.set("spark.rapids.sql.concurrentGpuTasks", concurrentGpuTasks) Pinned 8g host memory to transfer data between GPU and host memoryconf.set("spark.rapids.memory.pinnedPool.size", pinnedPoolSize) 16 tasks will run concurrently per executor, as we set spark.executor.cores=16#conf.set("spark.task.resource.gpu.amount", gpuPerExecutor) conf.set("spark.rapids.sql.enabled", "true") """ conf.set("spark.eventLog.enabled", "true") Create spark sessionspark = SparkSession.builder.config(conf=conf).getOrCreate() |
Beta Was this translation helpful? Give feedback.
-
If the churn demo using the exact same notebook setup (and env settings) works but your project code does not, then that indicates the problem lies in the project code. If it's possible to create a small repro project that can be shared or if you can share the eventlog from the problematic project then that would be awesome. Here's another thing to check: Is the project using the DataFrame or SQL APIs in Spark or just RDDs? The RAPIDS Accelerator only accelerates applications that are using the Spark DataFrame/SQL APIs (directly or indirectly). If you're not sure, check the Spark UI from the eventlog for the application. You should see one or more jobs on the Jobs tab (if the application is doing anything distributed with Spark). Assuming you see some jobs, there should be a "SQL / DataFrame" tab that will show at least one query. If there is no "SQL / DataFrame" tab or that tab contains no queries then the application is not using Spark's SQL / DataFrame API and that would explain why there's no explain output when running the project. If you do see queries in the Spark UI for your project but still no output yet the churn demo on the same Spark env emits output then that would be quite mysterious. It would be interesting to know if you see any GPU operators replaced in the queries visible in the Spark SQL UI or if you see all the proper config settings under the Environment tab on the Spark UI. Some extra notes from what I can see in the notebook setup:
|
Beta Was this translation helpful? Give feedback.
-
there is a profiling of SQL result. i found the cpu time is too high. |
Beta Was this translation helpful? Give feedback.
-
You could try to create a smaller version of the project that only executes a couple of SQL queries and still replicates the problem. That will generate a smaller eventlog (and you can also compress it). This effort may also help create a small repro case, which would really accelerate our ability to diagnose the logging issue. If we can reproduce the logging issue on our end, it won't take long to figure out what's happening. It may be interesting to start with the working churn demo and slowly add/replace pieces with the new project to see at what point the logging stops working. For example, if the project is added after doing the churn processing in the same application, does the logging still work up until the point the project processing occurs? What if the churn demo is placed after the project processing? If the logging works for the churn demo in the same Spark env as the project, the project must be doing something that suppresses the logging.
This is probably because a significant portion of these queries were not translated to GPU operations. Note that for each one it lists a potential problem of UDF, indicating there was a user defined function that could not be translated to the GPU. Depending upon how expensive the UDF is to calculate, how many rows were sent through the UDF, and what other operations fell back along with the UDF, it could be a significant amount of time spent doing CPU processing. Seeing the eventlog for one of these queries would help, along with any RAPIDS Accelerator explain output as to why some operations were not placed on the GPU. The eventlog would still be useful without the explain output. Seeing the Note that even if all of the SQL operations of a query are translated to GPU accelerated operations, the CPU time will likely not be zero. There are some phases of a typical query where the CPU is still significantly involved, such as the processing of distributed filesystem read/write (especially if there is TLS encryption to handle) and shuffle data compress/decompress and read/write. So if a query is heavy on data transfers, either via read/write or shuffle, it will have a higher CPU time than one that does not. |
Beta Was this translation helpful? Give feedback.
-
Thanks for sharing the eventlog! I can see from the eventlog that the RAPIDS Accelerator is enabled and is operating on the queries. Almost everything gets translated to GPU operations. So the good news is the RAPIDS Accelerator appears to be operating as expected. Unfortunately, I cannot explain how the log messages are not being emitted in the driver log. The config settings look correct to me. Have you had any luck whittling down the project code to a minimal reproduce? If the logging is working with the churn demo then I don't think the logging problem is an issue with the RAPIDS Accelerator. It's just using slf4j APIs for the logging, just like Spark, and in most cases is using the same My best guess at this point is something in the project is adjusting the logging setup and that's why it works for churn but not in the project. If I could reproduce it locally, I would next hook up a debugger and put a breakpoint at https://github.com/NVIDIA/spark-rapids/blob/branch-23.10/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala#L4507 to both verify that the RAPIDS Accelerator is trying to log and then step into the |
Beta Was this translation helpful? Give feedback.
-
Regarding suggestions for config tuning, I recommend checking out the tuning guide. You might get a performance boost by tuning spark.sql.files.maxPartitionBytes, for example. I recommend looking at the time spent in each node in the query plan(s) that are slower than expected and see where all the time is being spent. Another way to tackle this is to examine the stages view and see which stages are taking a long time and which operations are being performed in those stages (from the stage DAG view). That can help focus tuning efforts on the operations that are taking the most time. I'm assuming the eventlog that was posted is just a toy example, as almost all queries are executing in under a second. Note that for such quick queriies, the overhead of Spark starts to be a significant factor. Regarding the qualification tool, it is only an estimate and not. a guarantee of performance. There are some queries that will be mispredicted to some extent, since not all details are available in the eventlog. However we're always working to improve the accuracy of the qualification tool, and if you have a repro case we can analyze that would be great. Ideally we would be able to see the CPU eventlog fed to the qualification tool and the eventlog of the GPU run for the same query showing the lower-than-predicted performance. cc: @mattahrens |
Beta Was this translation helpful? Give feedback.
-
using same setting, demo can print out GPUOverrides messages like below picture.
But in our project , those messages can`t print out.
Beta Was this translation helpful? Give feedback.
All reactions