Skip to content

Latest commit

 

History

History
642 lines (523 loc) · 35.7 KB

spark-qualification-tool.md

File metadata and controls

642 lines (523 loc) · 35.7 KB
layout title nav_order
page
Qualification Tool
8

Qualification Tool

The Qualification tool analyzes Spark events generated from CPU based Spark applications to help quantify the expected acceleration of migrating a Spark application to GPU.

The tool first analyzes the CPU event log and determine which operators are likely to run on the GPU.
The tool then uses estimates from historical queries and benchmarks to estimate a speed-up at an individual operator level to calculate how much a specific operator would accelerate on GPU for the specific application.
It calculates an "Estimated GPU App Duration" by adding up the accelerated operator durations along with durations that could not run on GPU because they are unsupported operators or not SQL/Dataframe.

This tool is intended to give the users a starting point and does not guarantee the applications with the highest recommendation will actually be accelerated the most. Currently, it reports by looking at the amount of time spent in tasks of SQL Dataframe operations.

Disclaimer!
Estimates provided by the Qualification tool are based on the currently supported "SparkPlan" or "Executor Nodes" used in the application. It currently does not look at the expressions or datatypes used.
Please refer to the Supported Operators guide to check the types and expressions you are using are supported.

This document covers below topics:

  • TOC {:toc}

How to use the Qualification tool

The Qualification tool can be run in two different ways. One is to run it as a standalone tool on the Spark event logs after the application(s) have run and other is to be integrated into a running Spark application.

Running the Qualification tool standalone on Spark event logs

Prerequisites

  • Java 8 or above, Spark 3.0.1+ jars.
  • Spark event log(s) from Spark 2.0 or above version. Supports both rolled and compressed event logs with .lz4, .lzf, .snappy and .zstd suffixes as well as Databricks-specific rolled and compressed(.gz) event logs.
  • The tool does not support nested directories. Event log files or event log directories should be at the top level when specifying a directory.

Note: Spark event logs can be downloaded from Spark UI using a "Download" button on the right side, or can be found in the location specified by spark.eventLog.dir. See the Apache Spark Monitoring documentation for more information.

Step 1 Download the tools jar and Apache Spark 3 Distribution

The Qualification tool require the Spark 3.x jars to be able to run but do not need an Apache Spark run time. If you do not already have Spark 3.x installed, you can download the Spark distribution to any machine and include the jars in the classpath.

Step 2 Run the Qualification tool

  1. The Qualification tool reads the log files and process them in-memory. So the heap memory should be increased when processing large volume of events. It is recommended to pass VM options -Xmx10g and adjust according to the number-of-apps / size-of-logs being processed.

     export QUALIFICATION_HEAP=-Xmx10g
    
  2. Event logs stored on a local machine:

    • Extract the Spark distribution into a local directory if necessary.
    • Either set SPARK_HOME to point to that directory or just put the path inside of the classpath java -cp toolsJar:pathToSparkJars/*:... when you run the Qualification tool.

    This tool parses the Spark CPU event log(s) and creates an output report. Acceptable inputs are either individual or multiple event logs files or directories containing spark event logs in the local filesystem, HDFS, S3 or mixed.

    Usage: java ${QUALIFICATION_HEAP}
             -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \
             com.nvidia.spark.rapids.tool.qualification.QualificationMain [options]
             <eventlogs | eventlog directories ...>
    Sample: java ${QUALIFICATION_HEAP} \
              -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
              com.nvidia.spark.rapids.tool.qualification.QualificationMain /usr/logs/app-name1
  3. Event logs stored on an on-premises HDFS cluster:

    Example running on files in HDFS: (include $HADOOP_CONF_DIR in classpath)

    Usage: java ${QUALIFICATION_HEAP} \
             -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
             com.nvidia.spark.rapids.tool.qualification.QualificationMain  /eventlogDir

    Note, on an HDFS cluster, the default filesystem is likely HDFS for both the input and output so if you want to point to the local filesystem be sure to include file: in the path.

Qualification tool options

Note: --help should be before the trailing event logs.

java -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
 com.nvidia.spark.rapids.tool.qualification.QualificationMain --help

RAPIDS Accelerator Qualification tool for Apache Spark

Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
       com.nvidia.spark.rapids.tool.qualification.QualificationMain [options]
       <eventlogs | eventlog directories ...>

      --all                        Apply multiple event log filtering criteria
                                   and process only logs for which all
                                   conditions are satisfied.Example: <Filter1>
                                   <Filter2> <Filter3> --all -> result is
                                   <Filter1> AND <Filter2> AND <Filter3>.
                                   Default is all=true
      --any                        Apply multiple event log filtering criteria
                                   and process only logs for which any condition
                                   is satisfied.Example: <Filter1> <Filter2>
                                   <Filter3> --any -> result is <Filter1> OR
                                   <Filter2> OR <Filter3>
  -a, --application-name  <arg>    Filter event logs by application name. The
                                   string specified can be a regular expression,
                                   substring, or exact match. For filtering
                                   based on complement of application name, use
                                   ~APPLICATION_NAME. i.e Select all event logs
                                   except the ones which have application name
                                   as the input string.
  -f, --filter-criteria  <arg>     Filter newest or oldest N eventlogs based on
                                   application start timestamp, unique
                                   application name or filesystem timestamp.
                                   Filesystem based filtering happens before any
                                   application based filtering.For application
                                   based filtering, the order in which filters
                                   areapplied is: application-name,
                                   start-app-time, filter-criteria.Application
                                   based filter-criteria are:100-newest (for
                                   processing newest 100 event logs based on
                                   timestamp insidethe eventlog) i.e application
                                   start time)  100-oldest (for processing
                                   oldest 100 event logs based on timestamp
                                   insidethe eventlog) i.e application start
                                   time)  100-newest-per-app-name (select at
                                   most 100 newest log files for each unique
                                   application name) 100-oldest-per-app-name
                                   (select at most 100 oldest log files for each
                                   unique application name)Filesystem based
                                   filter criteria are:100-newest-filesystem
                                   (for processing newest 100 event logs based
                                   on filesystem timestamp).
                                   100-oldest-filesystem (for processing oldest
                                   100 event logsbased on filesystem timestamp).
  -h, --html-report                Default is to generate an HTML report.
      --no-html-report             Disables generating the HTML report.
  -m, --match-event-logs  <arg>    Filter event logs whose filenames contain the
                                   input string. Filesystem based filtering
                                   happens before any application based
                                   filtering.
  -n, --num-output-rows  <arg>     Number of output rows in the summary report.
                                   Default is 1000.
      --num-threads  <arg>         Number of thread to use for parallel
                                   processing. The default is the number of
                                   cores on host divided by 4.
      --order  <arg>               Specify the sort order of the report. desc or
                                   asc, desc is the default. desc (descending)
                                   would report applications most likely to be
                                   accelerated at the top and asc (ascending)
                                   would show the least likely to be accelerated
                                   at the top.
  -o, --output-directory  <arg>    Base output directory. Default is current
                                   directory for the default filesystem. The
                                   final output will go into a subdirectory
                                   called rapids_4_spark_qualification_output.
                                   It will overwrite any existing directory with
                                   the same name.
  -r, --report-read-schema         Whether to output the read formats and
                                   datatypes to the CSV file. This can be very
                                   long. Default is false.
      --spark-property  <arg>...   Filter applications based on certain Spark
                                   properties that were set during launch of the
                                   application. It can filter based on key:value
                                   pair or just based on keys. Multiple configs
                                   can be provided where the filtering is done
                                   if any of theconfig is present in the
                                   eventlog. filter on specific configuration:
                                   --spark-property=spark.eventLog.enabled:truefilter
                                   all eventlogs which has config:
                                   --spark-property=spark.driver.portMultiple
                                   configs:
                                   --spark-property=spark.eventLog.enabled:true
                                   --spark-property=spark.driver.port
  -s, --start-app-time  <arg>      Filter event logs whose application start
                                   occurred within the past specified time
                                   period. Valid time periods are
                                   min(minute),h(hours),d(days),w(weeks),m(months).
                                   If a period is not specified it defaults to
                                   days.
  -t, --timeout  <arg>             Maximum time in seconds to wait for the event
                                   logs to be processed. Default is 24 hours
                                   (86400 seconds) and must be greater than 3
                                   seconds. If it times out, it will report what
                                   it was able to process up until the timeout.
  -u, --user-name  <arg>           Applications which a particular user has
                                   submitted.
      --help                       Show help message

 trailing arguments:
  eventlog (required)   Event log filenames(space separated) or directories
                        containing event logs. eg: s3a://<BUCKET>/eventlog1
                        /path/to/eventlog2

Example commands:

  • Process the 10 newest logs, and only output the top 3 in the output:
java ${QUALIFICATION_HEAP} \
  -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
  com.nvidia.spark.rapids.tool.qualification.QualificationMain -f 10-newest -n 3 /eventlogDir
  • Process last 100 days' logs:
java ${QUALIFICATION_HEAP} \
  -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
  com.nvidia.spark.rapids.tool.qualification.QualificationMain -s 100d /eventlogDir
  • Process only the newest log with the same application name:
java ${QUALIFICATION_HEAP} \
  -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
  com.nvidia.spark.rapids.tool.qualification.QualificationMain -f 1-newest-per-app-name /eventlogDir

Note: The “regular expression” used by -a option is based on java.util.regex.Pattern.

The Qualification tool output

After the above command is executed, the summary report goes to STDOUT and by default it outputs log/CSV files under ./rapids_4_spark_qualification_output/ that contain the processed applications. The output will go into your default filesystem and it supports both local filesystem and HDFS. Note that if you are on an HDFS cluster the default filesystem is likely HDFS for both the input and output. If you want to point to the local filesystem be sure to include file: in the path.

The Qualification tool generates a brief summary on the STDOUT, which also gets saved as a text file. The detailed report of the processed apps is saved as a set of CSV files that can be used for post-processing. The CSV reports include the estimated performance if the app is run on the GPU for each of the following: app execution; stages; and execs.

Starting with release "22.06", the default is to generate the report into two different formats: text files; and HTML.

For information on the files content and processing the Qualification report and the recommendation, please refer to Understanding the Qualification tool output and Output Formats sections below.

Running the Qualification tool inside a running Spark application

Prerequisites

  • Java 8 or above, Spark 3.0.1+

Download the tools jar

Modify your application code to call the api's

Currently only Scala api's are supported.

Create the RunningQualicationApp:

val qualApp = new com.nvidia.spark.rapids.tool.qualification.RunningQualificationApp()

Get the event listener from it and install it as a Spark listener:

val listener = qualApp.getEventListener
spark.sparkContext.addSparkListener(listener)

Run your queries and then get the summary or detailed output to see the results.

The summary output api:

/**
 * Get the summary report for qualification.
 * @param delimiter The delimiter separating fields of the summary report.
 * @param prettyPrint Whether to including the separate at start and end and
 *                    add spacing so the data rows align with column headings.
 * @return String of containing the summary report.
 */
getSummary(delimiter: String = "|", prettyPrint: Boolean = true): String

The detailed output api:

/**
 * Get the detailed report for qualification.
 * @param delimiter The delimiter separating fields of the summary report.
 * @param prettyPrint Whether to including the separate at start and end and
 *                    add spacing so the data rows align with column headings.
 * @return String of containing the detailed report.
 */
getDetailed(delimiter: String = "|", prettyPrint: Boolean = true, reportReadSchema: Boolean = false): String

Example:

// run your sql queries ...

// To get the summary output:
val summaryOutput = qualApp.getSummary()

// To get the detailed output:
val detailedOutput = qualApp.getDetailed()

// print the output somewhere for user to see
println(summaryOutput)
println(detailedOutput)

If you need to specify the tools jar as a maven dependency to compile the Spark application:

<dependency>
   <groupId>com.nvidia</groupId>
   <artifactId>rapids-4-spark-tools_2.12</artifactId>
   <version>${version}</version>
</dependency>

Run the Spark application

  • Run your Spark application and include the tools jar you downloaded with the spark '--jars' options and view the output wherever you had it printed.

For example, if running the spark-shell:

$SPARK_HOME/bin/spark-shell --jars rapids-4-spark-tools_2.12-<version>.jar

Understanding the Qualification tool output

For each processed Spark application, the Qualification tool generates two main fields to help quantify the expected acceleration of migrating a Spark application to GPU.

  1. Estimated GPU Duration: Predicted runtime of the app if it was run on GPU. It is the sum add of the accelerated operator durations along with durations that could not run on GPU because they are unsupported operators or not SQL/Dataframe.
  2. Estimated Speed-up factor: The estimated speed-up factor is simply the original CPU duration of the app divided by the estimated GPU duration. That will estimate how much faster the application would run on GPU.

The lower the estimated GPU duration, the higher the "Estimated Speed-up". The processed applications are ranked by the "Estimated Speed-up". Based on how high the speed-up factor, the tool classifies the applications into the following different categories:

  • Strongly Recommended
  • Recommended
  • Not Recommended
  • Not Applicable: Indicates that the app has job or stage failures.

As mentioned before, the tool does not guarantee the applications with the highest recommendation will actually be accelerated the most. Please refer to Supported Operators section.

In addition to the recommendation, the Qualification tool reports a set of metrics in tasks of SQL Dataframe operations within the scope of: "Entire App"; "Stages"; and "Execs". The list of fields are described in details in Fields Interpretations section. Then we describe the output formats and their file locations in Output Formats section.

Detailed App Report

The report represents the entire app execution, including unsupported operators and non-SQL operations.

  1. App Name
  2. App ID
  3. Recommendation: Recommendation based on Estimated Speed-up Factor, where an app can be "Strongly Recommended", "Recommended", "Not Recommended", or "Not Applicable". The latter indicates that the app has job or stage failures.
  4. App Duration: Wall-Clock time measured since the application starts till it is completed. If an app is not completed an estimated completion time would be computed.
  5. SQL DF duration: Wall-Clock time duration that includes only SQL-Dataframe queries.
  6. GPU Opportunity: Wall-Clock time that shows how much of the SQL duration can be accelerated on the GPU.
  7. Estimated GPU Duration: Predicted runtime of the app if it was run on GPU. It is the sum of the accelerated operator durations along with durations that could not run on GPU because they are unsupported operators or not SQL/Dataframe.
  8. Estimated GPU Speed-up: The speed-up factor is simply the original CPU duration of the app divided by the estimated GPU duration. That will estimate how much faster the application would run on GPU.
  9. Estimated GPU Time Saved: Estimated Wall-Clock time saved if it was run on the GPU.
  10. SQL Dataframe Task Duration: Amount of time spent in tasks of SQL Dataframe operations.
  11. Executor CPU Time Percent: This is an estimate at how much time the tasks spent doing processing on the CPU vs waiting on IO. This is not always a good indicator because sometimes the IO that is encrypted and the CPU has to do work to decrypt it, so the environment you are running on needs to be taken into account.
  12. SQL Ids with Failures: SQL Ids of queries with failed jobs.
  13. Unsupported Read File Formats and Types: Looks at the Read Schema and reports the file formats along with types which may not be fully supported. Example: JDBC[*]. Note that this is based on the current version of the plugin and future versions may add support for more file formats and types.
  14. Unsupported Write Data Format: Reports the data format which we currently don’t support, i.e. if the result is written in JSON or CSV format.
  15. Complex Types: Looks at the Read Schema and reports if there are any complex types(array, struct or maps) in the schema.
  16. Nested Complex Types: Nested complex types are complex types which contain other complex types (Example: array<struct<string,string>>). Note that it can read all the schemas for DataSource V1. The Data Source V2 truncates the schema, so if you see ..., then the full schema is not available. For such schemas we read until ... and report if there are any complex types and nested complex types in that.
  17. Potential Problems: Some UDFs and nested complex types. Please keep in mind that the tool is only able to detect certain issues.
  18. Longest SQL Duration: The maximum amount of time spent in a single task of SQL Dataframe operations.
  19. NONSQL Task Duration Plus Overhead: Time duration that does not span any running SQL task.
  20. Unsupported Task Duration: Sum of task durations for any unsupported operators.
  21. Supported SQL DF Task Duration: Sum of task durations that are supported by RAPIDS GPU acceleration.
  22. Task Speedup Factor: The average speed-up of all stages.
  23. App Duration Estimated: True or False indicates if we had to estimate the application duration. If we had to estimate it, the value will be True and it means the event log was missing the application finished event, so we will use the last job or sql execution time we find as the end time used to calculate the duration.
  24. Read Schema: Shows the datatypes and read formats. This field is only listed when the argument --report-read-schema is passed to the CLI.

Note: the Qualification tool won't catch all UDFs, and some of the UDFs can be handled with additional steps. Please refer to Supported Operators for more details on UDF.

Stages report

For each stage used in SQL operations, the Qualification tool generates the following information:

  1. App ID
  2. Stage ID
  3. Average Speedup Factor: The average estimated speed-up of all the operators in the given stage.
  4. Stage Task Duration: Amount of time spent in tasks of SQL Dataframe operations for the given stage.
  5. Unsupported Task Duration: Sum of task durations for the unsupported operators. For more details, see Supported Operators.
  6. Stage Estimated: True or False indicates if we had to estimate the stage duration.

Note that this report is currently limited to the CSV file format and not supported in the HTML report.

Execs report

The Qualification tool generates a report of the "Exec" in the "SparkPlan" or "Executor Nodes" along with the estimated acceleration on the GPU. Please refer to the Supported Operators guide for more details on limitations on UDFs and unsupported operators.

  1. App ID
  2. SQL ID
  3. Exec Name: example Filter, HashAggregate
  4. Expression Name
  5. Task Speedup Factor: It is simply the average acceleration of the operators based on th original CPU duration of the operator divided by the GPU duration. The tool uses historical queries and benchmarks to estimate a speed-up at an individual operator level to calculate how much a specific operator would accelerate on GPU.
  6. Exec Duration: Wall-Clock time measured since the operator starts till it is completed.
  7. SQL Node Id
  8. Exec Is Supported: Whether the Exec is supported by RAPIDS or not. Please refer to the Supported Operators section.
  9. Exec Stages: An array of stage IDs
  10. Exec Children
  11. Exec Children Node Ids
  12. Exec Should Remove

Note that this report is currently limited to the CSV file format and not supported in the HTML report.

Output Formats

The Qualification tool generates the output as CSV/log files. Starting from "22.06", the default is to generate the report into two different formats: CSV/log files; and HTML.

HTML Report

Starting with release "22.06", the HTML report is generated by default under the output directory ${OUTPUT_FOLDER}/ui. The HTML report is disabled by passing --no-html-report as described in the Qualification tool options section above.
To browse the content of the html report:

  1. For HDFS or remote node, copy the directory of ${OUTPUT_FOLDER}/ui to your local node.
  2. Open ui/index.html in your local machine's web-browser (Chrome/Firefox are recommended).

The HTML view renders the detailed information into tables that allow following features:

  • searching
  • ordering by specific column
  • exporting table into CSV file
  • interactive filter by recommendations and/or user-name.

The following sections describe the HTML views.

Recommendations Summary

index.html shows the summary of the estimated GPU performance. The "GPU Recommendations Table" lists the processed applications ranked by the "Estimated GPU Speed-up" along with the ability to search, and filter the results.

  1. At the top of the page, the report shows a global summary of statistics, including: total number of apps analyzed; the number of apps that are recommended to run on the GPU; and the estimated time saved if the apps were run on the GPU.
  2. Filter panes with the capability to search the result table by selecting rows in the panes. The "Recommendations" and "Spark User" filters are cascaded which allows the panes to be filtered based on the values selected in the other pane.
  3. Text Search field that allows further filtering, removing data from the result set as keywords are entered. The search box will match on multiple columns including: "App ID", "App Name", "Recommendation"
  4. The Raw Data link in the left navigation bar redirects to a detailed report.
  5. HTML5 export button saves the table to CSV file named Qualification Tool Dashboard.csv into the browser's default download folder.

Qualification-HTML-Recommendation-View

Raw Data

raw.html displays the same all the fields listed in "Detailed App Report" in more readable format. Columns representing "time duration" are rounded to nearest "ms", "seconds", "minutes", and "hours".
The search box will match on multiple columns including: "App ID", "App Name", "Recommendation", "User Name", "Unsupported Write Data Format", "Complex Types", "Nested Complex Types", and "Read Schema". The detailed table can also be exported as Qualification Tool Dashboard – Raw Data.csv.

Text and CSV files

The Qualification tool generates a set of log/CSV files in the output folder ${OUTPUT_FOLDER}/rapids_4_spark_qualification_output. The content of each file is summarized in the following two sections.

Report Summary

The Qualification tool generates a brief summary that includes the projected application's performance if the application is run on the GPU. Beside sending the summary to STDOUT, the Qualification tool generates text as rapids_4_spark_qualification_output.log

The summary report outputs the following information: App Name, App ID, App Duration, SQL DF duration, GPU Opportunity, Estimated GPU Duration, Estimated GPU Speed-up, Estimated GPU Time Saved, and Recommendation.

Note: the duration(s) reported are in milliseconds. Sample output in text:

+------------+--------------+----------+----------+-------------+-----------+-----------+-----------+--------------------+
|  App Name  |    App ID    |    App   |  SQL DF  |     GPU     | Estimated | Estimated | Estimated |  Recommendation    |
|            |              | Duration | Duration | Opportunity |    GPU    |    GPU    |    GPU    |                    |
|            |              |          |          |             |  Duration |  Speedup  |    Time   |                    |
|            |              |          |          |             |           |           |   Saved   |                    |
+============+==============+==========+==========+=============+===========+===========+===========+====================+
| appName-01 | app-ID-01-01 |    898429|    879422|       879422|  273911.92|       3.27|  624517.06|Strongly Recommended|
+------------+--------------+----------+----------+-------------+-----------+-----------+-----------+--------------------+
| appName-02 | app-ID-02-01 |      9684|      1353|         1353|    8890.09|       1.08|      793.9|     Not Recommended|
+------------+--------------+----------+----------+-------------+-----------+-----------+-----------+--------------------+

In the above example, two application event logs were analyzed. “app-ID-01-01” is "Strongly Recommended" because Estimated GPU Speedup is ~3.27. On the other hand, the estimated acceleration running “app-ID-02-01” on the GPU is not high enough; hence the app is not recommended.

Detailed App Report

1. Entire App report

The first part of the detailed report is saved as rapids_4_spark_qualification_output.csv. The apps are processed and ranked by the Estimated GPU Speed-up. In addition to the fields listed in the "Report Summary", it shows all the app fields. The duration(s) are reported are in milliseconds.

2. Stages report

The second file is saved as rapids_4_spark_qualification_output_stages.csv.

Sample output in text:

+--------------+----------+-----------------+------------+---------------+-----------+
|    App ID    | Stage ID | Average Speedup | Stage Task |  Unsupported  |   Stage   |
|              |          |      Factor     |  Duration  | Task Duration | Estimated |
+==============+==========+=================+============+===============+===========+
| app-ID-01-01 |       25 |             2.1 |         23 |             0 |     false |
+--------------+----------+-----------------+------------+---------------+-----------+
| app-ID-02-01 |       29 |            1.86 |          0 |             0 |      true |
+--------------+----------+-----------------+------------+---------------+-----------+

3. Execs report

The last file is saved rapids_4_spark_qualification_output_execs.csv. Similar to the app and stage information, the table shows estimated GPU performance of the SQL Dataframe operations.

Sample output in text:

+--------------+--------+---------------------------+-----------------------+--------------+----------+----------+-----------+--------+----------------------------+---------------+-------------+
|    App ID    | SQL ID |         Exec Name         |    Expression Name    | Task Speedup |   Exec   | SQL Node |  Exec Is  |  Exec  |        Exec Children       | Exec Children | Exec Should |
|              |        |                           |                       |    Factor    | Duration |    Id    | Supported | Stages |                            |    Node Ids   |    Remove   |
+==============+========+===========================+=======================+==============+==========+==========+===========+========+============================+===============+=============+
| app-ID-02-01 |      7 | Execute CreateViewCommand |                       |          1.0 |        0 |        0 |     false |        |                            |               |       false |
+--------------+--------+---------------------------+-----------------------+--------------+----------+----------+-----------+--------+----------------------------+---------------+-------------+
| app-ID-02-01 |     24 |                   Project |                       |          2.0 |        0 |       21 |      true |        |                            |               |       false |
+--------------+--------+---------------------------+-----------------------+--------------+----------+----------+-----------+--------+----------------------------+---------------+-------------+
| app-ID-02-01 |     24 |              Scan parquet |                       |          2.0 |      260 |       36 |      true |     24 |                            |               |       false |
+--------------+--------+---------------------------+-----------------------+--------------+----------+----------+-----------+--------+----------------------------+---------------+-------------+
| app-ID-02-01 |     15 | Execute CreateViewCommand |                       |          1.0 |        0 |        0 |     false |        |                            |               |       false |
+--------------+--------+---------------------------+-----------------------+--------------+----------+----------+-----------+--------+----------------------------+---------------+-------------+
| app-ID-02-01 |     24 |                   Project |                       |          2.0 |        0 |       14 |      true |        |                            |               |       false |
+--------------+--------+---------------------------+-----------------------+--------------+----------+----------+-----------+--------+----------------------------+---------------+-------------+
| app-ID-02-01 |     24 |     WholeStageCodegen (6) | WholeStageCodegen (6) |          2.8 |      272 |        2 |      true |     30 | Project:BroadcastHashJoin: |         3:4:5 |       false |
|              |        |                           |                       |              |          |          |           |        |              HashAggregate |               |             |
+--------------+--------+---------------------------+-----------------------+--------------+----------+----------+-----------+--------+----------------------------+---------------+-------------+

How to compile the tools jar

Note: This step is optional.

git clone https://github.com/NVIDIA/spark-rapids.git
cd spark-rapids
mvn -Pdefault -pl .,tools clean verify -DskipTests

The jar is generated in below directory :

./tools/target/rapids-4-spark-tools_2.12-<version>.jar

If any input is a S3 file path or directory path, 2 extra steps are needed to access S3 in Spark:

  1. Download the matched jars based on the Hadoop version:

    • hadoop-aws-<version>.jar
    • aws-java-sdk-<version>.jar
  2. Take Hadoop 2.7.4 for example, we can download and include below jars in the '--jars' option to spark-shell or spark-submit: hadoop-aws-2.7.4.jar and aws-java-sdk-1.7.4.jar

  3. In $SPARK_HOME/conf, create hdfs-site.xml with below AWS S3 keys inside:

<?xml version="1.0"?>
<configuration>
<property>
  <name>fs.s3a.access.key</name>
  <value>xxx</value>
</property>
<property>
  <name>fs.s3a.secret.key</name>
  <value>xxx</value>
</property>
</configuration>

Please refer to this doc on more options about integrating hadoop-aws module with S3.