Skip to content

Commit

Permalink
Merge dev into main
Browse files Browse the repository at this point in the history
Signed-off-by: spark-rapids automation <[email protected]>
  • Loading branch information
nvauto committed Oct 18, 2023
2 parents 4b43fe4 + 568adf8 commit 2d13259
Show file tree
Hide file tree
Showing 123 changed files with 3,018 additions and 451 deletions.
4 changes: 4 additions & 0 deletions .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,10 @@ disable=
too-many-arguments,
# R0914: Too many local variables
too-many-locals,
# R0801: Similar lines in 2 files
duplicate-code,
# R0401: Cylic import
cyclic-import,
# R0912: Too many branches
too-many-branches,
useless-object-inheritance,
Expand Down
27 changes: 15 additions & 12 deletions core/docs/spark-qualification-tool.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,15 +23,18 @@ Spark resources.
The estimations for GPU duration are available for different environments and are based on benchmarks run in the
applicable environments. Here are the cluster information for the ETL benchmarks used for the estimates:

| Environment | CPU Cluster | GPU Cluster |
|------------------|-------------------|--------------------------------|
| On-prem | 8x 128-core | 8x 128-core + 8x A100 40 GB |
| Dataproc (T4) | 4x n1-standard-32 | 4x n1-standard-32 + 8x T4 16GB |
| Dataproc (L4) | 8x n1-standard-16 | 8x g2-standard-16 |
| EMR (T4) | 8x m5d.8xlarge | 4x g4dn.12xlarge |
| EMR (A10) | 8x m5d.8xlarge | 8x g5.8xlarge |
| Databricks AWS | 8x m6gd.8xlage | 8x g5.8xlarge |
| Databricks Azure | 8x E8ds_v4 | 8x NC8as_T4_v3 |
| Environment | CPU Cluster | GPU Cluster |
|--------------------------|-------------------|--------------------------------|
| On-prem | 8x 128-core | 8x 128-core + 8x A100 40 GB |
| Dataproc (T4) | 4x n1-standard-32 | 4x n1-standard-32 + 8x T4 16GB |
| Dataproc (L4) | 8x n1-standard-16 | 8x g2-standard-16 |
| Dataproc Serverless (L4) | 8x 16 cores | 8x 16 cores + 8x L4 24GB |
| Dataproc GKE (T4) | 8x n1-standard-32 | 8x n1-standard-32 + 8x T4 16GB |
| Dataproc GKE (L4) | 8x n1-standard-32 | 8x n1-standard-32 + 8x L4 24GB |
| EMR (T4) | 8x m5d.8xlarge | 4x g4dn.12xlarge |
| EMR (A10) | 8x m5d.8xlarge | 8x g5.8xlarge |
| Databricks AWS | 8x m6gd.8xlage | 8x g5.8xlarge |
| Databricks Azure | 8x E8ds_v4 | 8x NC8as_T4_v3 |

Note that all benchmarks were run using the [NDS benchmark](https://github.com/NVIDIA/spark-rapids-benchmarks/tree/dev/nds) at SF3K (3 TB).

Expand Down Expand Up @@ -247,9 +250,9 @@ Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
-p, --per-sql Report at the individual SQL query level.
--platform <arg> Cluster platform where Spark CPU workloads were
executed. Options include onprem, dataproc-t4,
dataproc-l4, emr-t4, emr-a10, databricks-aws, and
databricks-azure.
Default is onprem.
dataproc-l4, dataproc-serverless-l4, dataproc-gke-t4,
dataproc-gke-l4, emr-t4, emr-a10, databricks-aws,
and databricks-azure. Default is onprem.
-r, --report-read-schema Whether to output the read formats and
datatypes to the CSV file. This can be very
long. Default is false.
Expand Down
2 changes: 1 addition & 1 deletion core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
<artifactId>rapids-4-spark-tools_2.12</artifactId>
<name>RAPIDS Accelerator for Apache Spark tools</name>
<description>RAPIDS Accelerator for Apache Spark tools</description>
<version>23.08.1</version>
<version>23.08.2-SNAPSHOT</version>
<packaging>jar</packaging>
<url>http://github.com/NVIDIA/spark-rapids-tools</url>

Expand Down
6 changes: 6 additions & 0 deletions core/src/main/resources/operatorsScore-databricks-aws.csv
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ CollectSet,2.45
Concat,2.45
ConcatWs,2.45
Contains,2.45
Conv,2.45
Cos,2.45
Cosh,2.45
Cot,2.45
Expand All @@ -93,6 +94,7 @@ DayOfWeek,2.45
DayOfYear,2.45
DenseRank,2.45
Divide,2.45
DynamicPruningExpression,2.45
ElementAt,2.45
EndsWith,2.45
EqualNullSafe,2.45
Expand All @@ -101,6 +103,7 @@ Exp,2.45
Explode,2.45
Expm1,2.45
First,2.45
Flatten,2.45
Floor,2.45
FromUTCTimestamp,2.45
FromUnixTime,2.45
Expand Down Expand Up @@ -128,6 +131,8 @@ IntegralDivide,2.45
IsNaN,2.45
IsNotNull,2.45
IsNull,2.45
JsonToStructs,2.45
JsonTuple,2.45
KnownFloatingPointNormalized,2.45
KnownNotNull,2.45
Lag,2.45
Expand Down Expand Up @@ -248,6 +253,7 @@ VarianceSamp,2.45
WeekDay,2.45
WindowExpression,2.45
WindowSpecDefinition,2.45
XxHash64,2.45
Year,2.45
AggregateInPandasExec,1.2
ArrowEvalPythonExec,1.2
Expand Down
6 changes: 6 additions & 0 deletions core/src/main/resources/operatorsScore-databricks-azure.csv
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ CollectSet,2.73
Concat,2.73
ConcatWs,2.73
Contains,2.73
Conv,2.73
Cos,2.73
Cosh,2.73
Cot,2.73
Expand All @@ -93,6 +94,7 @@ DayOfWeek,2.73
DayOfYear,2.73
DenseRank,2.73
Divide,2.73
DynamicPruningExpression,2.73
ElementAt,2.73
EndsWith,2.73
EqualNullSafe,2.73
Expand All @@ -101,6 +103,7 @@ Exp,2.73
Explode,2.73
Expm1,2.73
First,2.73
Flatten,2.73
Floor,2.73
FromUTCTimestamp,2.73
FromUnixTime,2.73
Expand Down Expand Up @@ -128,6 +131,8 @@ IntegralDivide,2.73
IsNaN,2.73
IsNotNull,2.73
IsNull,2.73
JsonToStructs,2.73
JsonTuple,2.73
KnownFloatingPointNormalized,2.73
KnownNotNull,2.73
Lag,2.73
Expand Down Expand Up @@ -248,6 +253,7 @@ VarianceSamp,2.73
WeekDay,2.73
WindowExpression,2.73
WindowSpecDefinition,2.73
XxHash64,2.73
Year,2.73
AggregateInPandasExec,1.2
ArrowEvalPythonExec,1.2
Expand Down
256 changes: 256 additions & 0 deletions core/src/main/resources/operatorsScore-dataproc-gke-l4.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
CPUOperator,Score
CoalesceExec,3.74
CollectLimitExec,3.74
ExpandExec,4.07
FileSourceScanExec,2.65
FilterExec,3.52
GenerateExec,3.74
GlobalLimitExec,3.74
LocalLimitExec,3.74
ProjectExec,3.74
RangeExec,3.74
SampleExec,3.74
SortExec,3.74
TakeOrderedAndProjectExec,3.74
HashAggregateExec,4.29
ObjectHashAggregateExec,4.29
SortAggregateExec,4.29
DataWritingCommandExec,3.74
ExecutedCommandExec,3.74
BatchScanExec,2.65
ShuffleExchangeExec,3.2
BroadcastHashJoinExec,3.4
BroadcastNestedLoopJoinExec,1.62
CartesianProductExec,3.74
ShuffledHashJoinExec,3.74
SortMergeJoinExec,5.1
WindowExec,3.74
Abs,3.74
Acos,3.74
Acosh,3.74
Add,3.74
AggregateExpression,3.74
Alias,3.74
And,3.74
ApproximatePercentile,3.74
ArrayContains,3.74
ArrayExcept,3.74
ArrayExists,3.74
ArrayIntersect,3.74
ArrayMax,3.74
ArrayMin,3.74
ArrayRemove,3.74
ArrayRepeat,3.74
ArrayTransform,3.74
ArrayUnion,3.74
ArraysOverlap,3.74
ArraysZip,3.74
Asin,3.74
Asinh,3.74
AtLeastNNonNulls,3.74
Atan,3.74
Atanh,3.74
AttributeReference,3.74
Average,3.74
BRound,3.74
BitLength,3.74
BitwiseAnd,3.74
BitwiseNot,3.74
BitwiseOr,3.74
BitwiseXor,3.74
CaseWhen,3.74
Cbrt,3.74
Ceil,3.74
CheckOverflow,3.74
Coalesce,3.74
CollectList,3.74
CollectSet,3.74
Concat,3.74
ConcatWs,3.74
Contains,3.74
Conv,3.74
Cos,3.74
Cosh,3.74
Cot,3.74
Count,3.74
CreateArray,3.74
CreateMap,3.74
CreateNamedStruct,3.74
CurrentRow$,3.74
DateAdd,3.74
DateAddInterval,3.74
DateDiff,3.74
DateFormatClass,3.74
DateSub,3.74
DayOfMonth,3.74
DayOfWeek,3.74
DayOfYear,3.74
DenseRank,3.74
Divide,3.74
DynamicPruningExpression,3.74
ElementAt,3.74
EndsWith,3.74
EqualNullSafe,3.74
EqualTo,3.74
Exp,3.74
Explode,3.74
Expm1,3.74
First,3.74
Flatten,3.74
Floor,3.74
FromUTCTimestamp,3.74
FromUnixTime,3.74
GetArrayItem,3.74
GetArrayStructFields,3.74
GetJsonObject,3.74
GetMapValue,3.74
GetStructField,3.74
GetTimestamp,3.74
GreaterThan,3.74
GreaterThanOrEqual,3.74
Greatest,3.74
HiveGenericUDF,3.74
HiveSimpleUDF,3.74
Hour,3.74
Hypot,3.74
If,3.74
In,3.74
InSet,3.74
InitCap,3.74
InputFileBlockLength,3.74
InputFileBlockStart,3.74
InputFileName,3.74
IntegralDivide,3.74
IsNaN,3.74
IsNotNull,3.74
IsNull,3.74
JsonToStructs,3.74
JsonTuple,3.74
KnownFloatingPointNormalized,3.74
KnownNotNull,3.74
Lag,3.74
LambdaFunction,3.74
Last,3.74
LastDay,3.74
Lead,3.74
Least,3.74
Length,3.74
LessThan,3.74
LessThanOrEqual,3.74
Like,3.74
Literal,3.74
Log,3.74
Log10,3.74
Log1p,3.74
Log2,3.74
Logarithm,3.74
Lower,3.74
MakeDecimal,3.74
MapConcat,3.74
MapEntries,3.74
MapFilter,3.74
MapKeys,3.74
MapValues,3.74
Max,3.74
Md5,3.74
MicrosToTimestamp,3.74
MillisToTimestamp,3.74
Min,3.74
Minute,3.74
MonotonicallyIncreasingID,3.74
Month,3.74
Multiply,3.74
Murmur3Hash,3.74
NaNvl,3.74
NamedLambdaVariable,3.74
NormalizeNaNAndZero,3.74
Not,3.74
NthValue,3.74
OctetLength,3.74
Or,3.74
PercentRank,3.74
PivotFirst,3.74
Pmod,3.74
PosExplode,3.74
Pow,3.74
PreciseTimestampConversion,3.74
PromotePrecision,3.74
PythonUDF,3.74
Quarter,3.74
RLike,3.74
RaiseError,3.74
Rand,3.74
Rank,3.74
RegExpExtract,3.74
RegExpExtractAll,3.74
RegExpReplace,3.74
Remainder,3.74
ReplicateRows,3.74
Reverse,3.74
Rint,3.74
Round,3.74
RowNumber,3.74
ScalaUDF,3.74
ScalarSubquery,3.74
Second,3.74
SecondsToTimestamp,3.74
Sequence,3.74
ShiftLeft,3.74
ShiftRight,3.74
ShiftRightUnsigned,3.74
Signum,3.74
Sin,3.74
Sinh,3.74
Size,3.74
SortArray,3.74
SortOrder,3.74
SparkPartitionID,3.74
SpecifiedWindowFrame,3.74
Sqrt,3.74
StartsWith,3.74
StddevPop,3.74
StddevSamp,3.74
StringInstr,3.74
StringLPad,3.74
StringLocate,3.74
StringRPad,3.74
StringRepeat,3.74
StringReplace,3.74
StringSplit,3.74
StringToMap,3.74
StringTranslate,3.74
StringTrim,3.74
StringTrimLeft,3.74
StringTrimRight,3.74
Substring,3.74
SubstringIndex,3.74
Subtract,3.74
Sum,3.74
Tan,3.74
Tanh,3.74
TimeAdd,3.74
ToDegrees,3.74
ToRadians,3.74
ToUnixTimestamp,3.74
TransformKeys,3.74
TransformValues,3.74
UnaryMinus,3.74
UnaryPositive,3.74
UnboundedFollowing$,3.74
UnboundedPreceding$,3.74
UnixTimestamp,3.74
UnscaledValue,3.74
Upper,3.74
VariancePop,3.74
VarianceSamp,3.74
WeekDay,3.74
WindowExpression,3.74
WindowSpecDefinition,3.74
XxHash64,3.74
Year,3.74
AggregateInPandasExec,1.2
ArrowEvalPythonExec,1.2
FlatMapGroupsInPandasExec,1.2
FlatMapCoGroupsInPandasExec,1.2
MapInPandasExec,1.2
WindowInPandasExec,1.2
Loading

0 comments on commit 2d13259

Please sign in to comment.