Introduce hybrid (CPU) scan for Parquet read #11720

res-life · 2024-11-13T13:50:43Z

Introduce hybrid (CPU) scan for Parquet read
This PR leverages Gluten/Velox to do scan on CPU.

hybrid feature contains

Gluten repo: In internal gitlab repo gluten-public
Hybrid MR: In internal gitlab repo rapids-hybrid-execution, branch 1.2
This Spark-Rapids PR

This PR

Add Shims

build for all shims: 320-324, 330-334, 340-344, 350-353, CDHs, Databricks, throw runtime error if it's CDH or Databricks runtime.

Checks

In Hybrid MR: Gluten bundle version
Scala version is 2.12
Java version is 1.8
Hybrid MR: Arch is amd64, OS is Ubuntu 22.04 or Ubuntu 20.04
Spark is not Databricks or CDH
Hybrid jar is in the classpath if Hybrid is enabled.
Scan runs properly when Hybrid jar is not in the classpath and Hybrid is disabled.

Call to Hybrid JNI to do Parquet scan

Limitations

supports more Spark versions than Gluten official supports

The Gluten official doc says only support Spark 322, 331, 342, 351.

Support Spark 3.2.2, 3.3.1, 3.4.2, and 3.5.1 with all UTs passed(if data type supported)

Hybrid supports totally 19 Spark versions(320-324, 330-334, 340-344, 350-353 ), and add doc to config HYBRID_PARQUET_READER that other versions except Gluten official supports are not fully tested.

tests

config	jars exists ?	result	comment
Hybrid enabled	Hybrid/Gluten jar are exist	pass
Hybrid enabled	Hybrid/Gluten jar are not exist	pass	Report Jar is not in the classpath
Hybrid disabled	Hybrid/Gluten jar are exist	pass	no error reported
Hybrid disabled	Hybrid/Gluten jar are not exist	pass	no error reported

Signed-off-by: sperlingxx [email protected]
Signed-off-by: Chong Gao [email protected]

res-life · 2024-11-13T13:52:00Z

It's draft, may missed some code change, will double check later.
This can not pass building, because Gluten backends-velox 1.2.0 jar is not deployed to public maven repo by Gluten community.
The building will pass if the Gluten jars are installed locally by maven install

jlowe

Please elaborate in the headline and description what this PR is doing. C2C is not a well-known acronym in the project and is not very descriptive.

revans2

Just a quick look at the code. Nothing too in depth.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuRowToColumnarExec.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

integration_tests/src/main/python/parquet_test.py

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

Signed-off-by: Chong Gao <[email protected]>

res-life · 2024-11-25T10:30:24Z

Passed IT. Tested conventional Spark-Rapids jar and regular Spark-Rapids jar.
Passed NDS test.
Will fix comments later.
Will push commits related to make a uber jar for all spark versions.

revans2

I need to do some manual testing on my own to try and understand what is happening here and how this is all working. It may take a while.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala

sql-plugin/src/main/scala/org/apache/spark/rapids/hybrid/CoalesceConvertIterator.scala

sql-plugin/src/main/scala/org/apache/spark/rapids/hybrid/HybridBackend.scala

sql-plugin/pom.xml

sql-plugin/src/main/scala/org/apache/spark/rapids/hybrid/HybridParquetScanRDD.scala

revans2 · 2024-11-25T15:13:03Z

...gin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/HybridFileSourceScanExecMeta.scala

+        case MapType(kt, vt, _) if kt.isInstanceOf[MapType] || vt.isInstanceOf[MapType] => false
+        // For the time being, BinaryType is not supported yet
+        case _: BinaryType => false
+        case _ => true


facebookincubator/velox#9560 I am not an expert, and I don't even know what version of velox we will end up using. It sounds like it is plugable. But according to this, even the latest version of velox cannot handle bytes/TINYINT. We are not looking for spaces in the names of columns, among other issues. I know that other implementations fall back for even more things. Should we be concerned about this?

Gluten uses another velox repo, code link

VELOX_REPO=https://github.com/oap-project/velox.git VELOX_BRANCH=gluten-1.2.1

This will be something we should remember once we switch to use facebookincubator/velox directly.

My main concern is that if the gluten/velox version we use is pluggable, then we need to have some clear documentation on exactly which version you need to be based off of.

sql-plugin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/ScanExecShims.scala

…park 322,331,343,351

sql-plugin/src/main/spark331/scala/com/nvidia/spark/rapids/shims/spark331/ScanExecShims.scala

sql-plugin/src/main/spark322/scala/com/nvidia/spark/rapids/shims/spark322/ScanExecShims.scala

… Databricks

…ly supports 3.2.2, 3.3.1, 3.4.2, and 3.5.1.

res-life · 2024-12-11T08:50:35Z

build

res-life · 2024-12-11T09:09:16Z

Depending on deoloying Hybrid 25.02 jar into Maven repo. @NvTimLiu

res-life · 2024-12-11T10:06:05Z

build

GaryShen2008 · 2024-12-13T09:50:12Z

...gin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/HybridFileSourceScanExecMeta.scala

+        "Hybrid jar is not in the classpath, Please add Hybrid jar into the class path, or " +
+            "Please disable Hybrid feature by setting " +
+            "spark.rapids.sql.parquet.useHybridReader=false")


Wrong exception message.

Did not get the point, could you provide the message

It's in checkJavaVersion. Shouldn't the message be related to Java version? I think you copied the code from other place but forgot to modify.

GaryShen2008 · 2024-12-13T10:05:09Z

...gin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/HybridFileSourceScanExecMeta.scala

+    try {
+      Class.forName(HYBRID_JAR_PLUGIN_CLASS_NAME)
+    } catch {
+      case e: ClassNotFoundException => throw new RuntimeException(
+        "Hybrid jar is not in the classpath, Please add Hybrid jar into the class path, or " +
+            "Please disable Hybrid feature by setting " +
+            "spark.rapids.sql.parquet.useHybridReader=false", e)
+    }


I think this way to check the class only works on driver side.
Do we need to check on executor side as well?

I think this way to check the class only works on driver side.

Yes.

Do we need to check on executor side as well?

Yes. Will check.

revans2

Mostly just more questions for me to understand what is happening. This looks a lot better. I assume a lot of the code that is very picky about getting the exact setup right is here just because that is what this code has been tested with.

integration_tests/src/main/python/parquet_test.py

revans2 · 2024-12-13T15:50:24Z

integration_tests/src/main/python/parquet_test.py

+     # MapGen(StringGen(pattern='key_[0-9]', nullable=False), simple_string_to_string_map_gen)
+     ],
+]
+


Can we add some tests to validate that predicate push down and filtering is working correctly? It would be nice to have

simple filters

complex filters that are not supported by normal parquet predicate push down. (like the ors at the top level instead of ands)

filters that have operators in them that velox does not support, but spark rapids does.

Discussed internally before, the decision is putting into a follow-up PR.

Follow-up issue filed: #11892

sql-plugin/src/main/scala/org/apache/spark/rapids/hybrid/CoalesceConvertIterator.scala

revans2 · 2024-12-13T16:05:50Z

...gin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/HybridFileSourceScanExecMeta.scala

+        case MapType(kt, vt, _) if kt.isInstanceOf[MapType] || vt.isInstanceOf[MapType] => false
+        // For the time being, BinaryType is not supported yet
+        case _: BinaryType => false
+        case _ => true


My main concern is that if the gluten/velox version we use is pluggable, then we need to have some clear documentation on exactly which version you need to be based off of.

revans2 · 2024-12-13T16:08:14Z

...gin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/HybridFileSourceScanExecMeta.scala

+    lazy val allSupportedTypes = fsse.requiredSchema.exists { field =>
+      TrampolineUtil.dataTypeExistsRecursively(field.dataType, {
+        // For the time being, the native backend may return incorrect results over nestedMap
+        case MapType(kt, vt, _) if kt.isInstanceOf[MapType] || vt.isInstanceOf[MapType] => false


What about if it is a MapType, but kt or vt is not directly a map, but might be a LIST of MAP, so a struct with a MAP in it? Do we know the cause of this error so that we can limit things properly? If not then I would rather just stick with a MAP at the top level and any nested maps are not allowed.

Also what happens if the data is a LIST Internally in Parquet a Map is just a LIST<STRUCT<KEY, VALUE>> would we have similar issues if we had one of them be nested?

Hi @revans2 , I am sorry that I did not audit carefully on which types is unsupported by native backend. Just before, I ran a rather comprehensive test:

hybrid_gens_test = [ # failed [decimal_gen_32bit_neg_scale], [decimal_gen_128bit], decimal_64_map_gens, [MapGen(TimestampGen(nullable=False), ArrayGen(string_gen))], [MapGen(RepeatSeqGen(IntegerGen(nullable=False), 10), TimestampGen())], [MapGen(RepeatSeqGen(IntegerGen(nullable=False), 10), decimal_gen_32bit)], [MapGen(RepeatSeqGen(IntegerGen(nullable=False), 10), decimal_gen_64bit)], # failed [MapGen(StringGen(pattern='key_[0-9]', nullable=False), decimal_gen_128bit)], [MapGen(StringGen(pattern='key_[0-9]', nullable=False), ArrayGen(string_gen))], [MapGen(StringGen(pattern='key_[0-9]', nullable=False), ArrayGen(ArrayGen(long_gen)))], [MapGen(StringGen(pattern='key_[0-9]', nullable=False), ArrayGen(ArrayGen(string_gen)))], [MapGen(StringGen(pattern='key_[0-9]', nullable=False), StructGen([['child0', string_gen], ['child1', double_gen], ['child2', int_gen], ['child3', StructGen([['child0', ArrayGen(byte_gen)], ['child1', byte_gen], ['child2', float_gen], ['child3', decimal_gen_64bit]])]])) ], [MapGen(StringGen(pattern='key_[0-9]', nullable=False), StructGen([['child0', ArrayGen(ArrayGen(long_gen))], ['child1', ArrayGen(string_gen)], ['child2', ArrayGen(ArrayGen(string_gen))]])) ], [MapGen(StringGen(pattern='key_[0-9]', nullable=False), ArrayGen(MapGen(LongGen(nullable=False), long_gen)))], [MapGen(StringGen(pattern='key_[0-9]', nullable=False), ArrayGen(MapGen(IntegerGen(nullable=False), string_gen)))], [MapGen(StringGen(pattern='key_[0-9]', nullable=False), ArrayGen(ArrayGen(MapGen(IntegerGen(nullable=False), string_gen))))], [ArrayGen(ArrayGen(string_gen))], [ArrayGen(ArrayGen(long_gen))], # failed [ArrayGen(MapGen(LongGen(nullable=False), long_gen))], # failed [ArrayGen(MapGen(StringGen(pattern='key_[0-9]', nullable=False), long_gen))], # failed [MapGen(StringGen(pattern='key_[0-9]', nullable=False), MapGen(LongGen(nullable=False), long_gen))], [MapGen(StringGen(pattern='key_[0-9]', nullable=False), MapGen(LongGen(nullable=False), string_gen))], [MapGen(StringGen(pattern='key_[0-9]', nullable=False), simple_string_to_string_map_gen)], # failed [StructGen([['child0', MapGen(LongGen(nullable=False), long_gen)], ['child1', MapGen(StringGen(pattern='key_[0-9]', nullable=False), long_gen)], ['child2', MapGen(IntegerGen(nullable=False), decimal_gen_64bit)], ['child3', StructGen([["cc", MapGen(IntegerGen(nullable=False), decimal_gen_32bit)]])] ]), ], [StructGen([['cc', MapGen(IntegerGen(nullable=False), decimal_gen_64bit)]])], # failed [StructGen([['cc', ArrayGen(MapGen(IntegerGen(nullable=False), string_gen))]])], [StructGen([['cc', ArrayGen(ArrayGen(MapGen(IntegerGen(nullable=False), string_gen)))]])], ]

The test result suggested the unsupported types are:

Decimal with negative scale is NOT supported

Decimal128 inside nested types is NOT supported

BinaryType is NOT supported

MapType inside nested types (Struct of Map/Array of Map/Map of Map) is NOT fully supported

I reworked the typeCheck function are integration tests according to the new finding.

...gin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/HybridFileSourceScanExecMeta.scala

revans2 · 2024-12-13T16:13:00Z

...gin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/HybridFileSourceScanExecMeta.scala

+    if (javaVersion == null) {
+      throw new RuntimeException("Hybrid feature: Can not read java.version, get null")
+    }
+    if (!javaVersion.startsWith("1.8")) {


Why does it only work with java 1.8? Newer versions are supposed to be backwards compatible.

Will test other Java version.

revans2 · 2024-12-13T16:14:31Z

...gin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/HybridFileSourceScanExecMeta.scala

+   */
+  private def checkScalaVersion(): Unit = {
+    val scalaVersion = scala.util.Properties.versionString
+    if (!scalaVersion.startsWith("version 2.12")) {


We already have shims and a separate jar for scala 2.13. @gerashegalov is there a way for us to have scala 2.13 specific code that would just fail instead of doing a check like this?

revans2 · 2024-12-13T16:15:41Z

sql-plugin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/ScanExecShims.scala

+      (fsse, conf, p, r) => {
+        // TODO: HybridScan supports DataSourceV2
+        if (HybridFileSourceScanExecMeta.useHybridScan(conf, fsse)) {
+          // Check if runtimes are satisfied: Spark is not Databricks or CDH; Java version is 1.8;


Why not databricks or CDH? Is it just that we have not tested with these yet?

Yes, because have not tested with CDH and Databricks.

Currently do not have customer to use CDH and Databricks; Did not test perf on CDH and Databricks.

We don't have confidence that the Hybrid thing support Databricks spark totally. So, for first version, we consider not to support Databricks.

GaryShen2008 · 2024-12-14T02:26:54Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

@@ -2895,6 +2912,10 @@ class RapidsConf(conf: Map[String, String]) extends Logging {

  lazy val avroDebugDumpAlways: Boolean = get(AVRO_DEBUG_DUMP_ALWAYS)

+  lazy val useHybridParquetReader: Boolean = get(HYBRID_PARQUET_READER)
+
+  lazy val loadHybridBackend: Boolean = get(LOAD_HYBRID_BACKEND)


Right now, it's only used like it must be true if useHybridParquetReader is true.
Where is the code to check this config then load the backend?

LOAD_HYBRID_BACKEND is a startup config, while HYBRID_PARQUET_READER is not. User can config LOAD_HYBRID_BACKEND as true on the startup time, and enable/disable HYBRID_PARQUET_READER at runtime on the fly. This is more flexible.

Should we have some code to check LOAD_HYBRID_BACKEND then try to load the jar when initializing the driver and executor plugin?

Signed-off-by: sperlingxx <[email protected]>

revans2

My issues have pretty much all been addressed and my questions answered. I do want to see a follow on issue filed for #11720 (comment)

I also want to understand the plan for documentation. I get that this is still very early and the configs are all marked as internal so I am okay with where it is at right now. I am not going to approve it yet because I want to hear from others on this too.

GaryShen2008 · 2024-12-19T02:51:35Z

As discussed with Chong, we also need a doc to describe how to build Gluten/Velox jar for the case that the external users want to have a try.

res-life requested review from jlowe and sperlingxx November 14, 2024 01:13

jlowe reviewed Nov 14, 2024

View reviewed changes

sameerz added the performance A performance related task/issue label Nov 16, 2024

revans2 reviewed Nov 20, 2024

View reviewed changes

res-life changed the base branch from branch-24.12 to branch-25.02 November 25, 2024 09:53

winningsix reviewed Nov 25, 2024

View reviewed changes

integration_tests/src/main/python/parquet_test.py Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Outdated Show resolved Hide resolved

Chong Gao and others added 5 commits November 25, 2024 18:18

Merge C2C code to main

65de010

Signed-off-by: Chong Gao <[email protected]>

Update the dependencies in pom.xml

e6cede2

revert BD velox hdfs code

1e4fc13

fit codes into the new HybridScan hierarchy

46e19df

refine QueryPlan, RapidsMeta and test suites for HybridScan

4f2a4d6

res-life force-pushed the merge-c2c branch from 74b6075 to 4f2a4d6 Compare November 25, 2024 10:19

Integrate Hybrid plugin; update IT

4d52f90

res-life marked this pull request as ready for review November 25, 2024 10:25

res-life requested review from tgravescs, GaryShen2008, NvTimLiu and gerashegalov as code owners November 25, 2024 10:25

revans2 requested changes Nov 25, 2024

View reviewed changes

res-life marked this pull request as draft November 26, 2024 00:59

winningsix changed the title ~~Merge C2C code to main~~ Introduce hybrid (CPU) scan for Parquet read Nov 26, 2024

Chong Gao added 2 commits December 4, 2024 17:01

Make Hybrid jar provoided scope; Update shim to only applicable for S…

c82eb29

…park 322,331,343,351

Fix comments

65b585a

res-life assigned res-life and sperlingxx Dec 4, 2024

Chong Gao added 2 commits December 5, 2024 09:10

Code comment update, a minor change

d214739

Fix shim logic

e0f1e3b

revans2 reviewed Dec 5, 2024

View reviewed changes

sql-plugin/src/main/spark331/scala/com/nvidia/spark/rapids/shims/spark331/ScanExecShims.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/spark322/scala/com/nvidia/spark/rapids/shims/spark322/ScanExecShims.scala Outdated Show resolved Hide resolved

Chong Gao added 7 commits December 6, 2024 17:34

Fix shims: build for all shims, but report error when Spark is CDH or…

6331ab8

… Databricks

Remove useless shim code

b1b8481

IT: add tests for decimal types

dbae63f

Add checks for Java/Scala version, only supports Java 1.8 and Scala 2.12

5e972d6

Check datasource is v1

c6fa249

Update test case: skip if runtime Spark is Databricks

e95e7cc

Update Hybrid Config doc: not all Spark versions are fully tested, on…

092dab8

…ly supports 3.2.2, 3.3.1, 3.4.2, and 3.5.1.

res-life marked this pull request as ready for review December 11, 2024 08:50

Merge branch 'branch-25.02' into merge-c2c

519f33c

some refinement

b3b6f80

GaryShen2008 reviewed Dec 13, 2024

View reviewed changes

revans2 reviewed Dec 13, 2024

View reviewed changes

GaryShen2008 reviewed Dec 14, 2024

View reviewed changes

fix tests && unsupported types

f0921a4

Signed-off-by: sperlingxx <[email protected]>

revans2 reviewed Dec 17, 2024

View reviewed changes

res-life mentioned this pull request Dec 19, 2024

[FEA] [FOLLOW-UP] [Hybrid/C2C] Validate predicate push down and filtering #11892

Open

Chong Gao added 3 commits December 23, 2024 18:20

Add doc for Hybrid execution feature

dd5d8f9

Update doc

6149589

Check Hybrid jar in executor

114b93a

Introduce hybrid (CPU) scan for Parquet read #11720

Are you sure you want to change the base?

Introduce hybrid (CPU) scan for Parquet read #11720

Conversation

res-life commented Nov 13, 2024 • edited by GaryShen2008 Loading

hybrid feature contains

This PR

Add Shims

Checks

Call to Hybrid JNI to do Parquet scan

Limitations

supports more Spark versions than Gluten official supports

tests

res-life commented Nov 13, 2024 • edited Loading

jlowe left a comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

res-life commented Nov 25, 2024

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Dec 11, 2024

res-life commented Dec 11, 2024

res-life commented Dec 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaryShen2008 Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

GaryShen2008 commented Dec 19, 2024

res-life commented Nov 13, 2024 •

edited by GaryShen2008

Loading

res-life commented Nov 13, 2024 •

edited

Loading

res-life Dec 16, 2024 •

edited

Loading

sperlingxx Dec 17, 2024 •

edited

Loading

GaryShen2008 Dec 18, 2024 •

edited

Loading