[BUG] Fix issue 11790 #11792

binmahone · 2024-11-29T03:50:31Z

the PR closes #11790 by fixing two randoml exceptions

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone · 2024-11-29T03:50:53Z

build

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

binmahone · 2024-11-29T11:23:20Z

build

binmahone · 2024-11-29T15:28:49Z

@revans2 this is a follow up fix for #11712, the nightly CI is all broken so I asked an urgent approval from @GaryShen2008 and @firestarman, and got it checked in as soon as possible . Please double check this PR and let me know any concerns.

gerashegalov · 2024-11-30T19:59:03Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

-              nextLayerBuckets) || repartitionHappened
-          nextLayerBuckets
+          if (hashSeed + 7 > 200) {
+            log.warn("Too many times of repartition, may hit a bug? Size for each batch in " +


Please file an issue for the bug and link it here

hi @gerashegalov , there's no known bug here.

I filed #11834 I agree that this is not a bug, but it is also not defensive, also the check requires tight coupling between the code the sets the hashSeed and this to work properly. I would like us to fix that.

The decoupling thing should be fixed by #11816 , please take a look.

gerashegalov · 2024-11-30T20:05:14Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

+              "current bucket: " + bucket.map(_.sizeInBytes).mkString(", ") + " rows: " +
+              bucket.map(_.numRows()).mkString(", ") + " targetMergeBatchSize: "
+              + targetMergeBatchSize)
+            ArrayBuffer.apply(bucket)


nit: could use the syntactic sugar here and elsewhere:

Suggested change

ArrayBuffer.apply(bucket)

ArrayBuffer(bucket)

But I still see the ArrayBuffer.apply(bucket)???

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

Line 285 in 738c8e3

ArrayBuffer.apply(bucket)

Did you forget to check in your last changes before you merged this? Or is there a follow on issue that you plan on doing.

abellina · 2024-12-02T14:43:01Z

It wasn't obvious to me that this is now logging a warning and not throwing an exception. @binmahone could you add some context in the description on why we need to stop throwing?

Could you also explain in the description why the change in the repartition iterator next call? (https://github.com/NVIDIA/spark-rapids/pull/11792/files#diff-db626a2b4f67ef3e74c3caea3364989ab2ab7cb440b6b087bda770fc4ea2c64cR1083)

Just so that in the future we can look at this PR and know at a glance what the bug was.

revans2

I am also concerned like @abellina that your fix actually has turned this into a live lock. We need a hard limit of some kind and we should decouple the hashSeed from that limit so it is simpler to understand what that limit is.

revans2 · 2024-12-02T21:57:36Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

@@ -219,9 +219,6 @@ object AggregateUtils extends Logging {
  ): Boolean = {

    var repartitionHappened = false
-    if (hashSeed > 200) {


Could we please decouple the hashSeed for the number of re-partition times being done? I think the code would be much more readable if we could say how many times we tried to re-partition the data instead of "too many times".

binmahone · 2024-12-04T03:24:49Z

Hi @gerashegalov @abellina @revans2 please check if https://github.com/NVIDIA/spark-rapids/pull/11816/files addresses all of your concerns.

fix issue 11790

77e5d20

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone self-assigned this Nov 29, 2024

pxLi added the bug Something isn't working label Nov 29, 2024

pxLi changed the title ~~fix issue 11790~~ [BUG] Fix issue 11790 Nov 29, 2024

pxLi requested review from firestarman and GaryShen2008 November 29, 2024 05:57

firestarman reviewed Nov 29, 2024

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala Show resolved Hide resolved

firestarman approved these changes Nov 29, 2024

View reviewed changes

GaryShen2008 reviewed Nov 29, 2024

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala Show resolved Hide resolved

binmahone merged commit aa2da41 into NVIDIA:branch-24.12 Nov 29, 2024
50 checks passed

gerashegalov reviewed Nov 30, 2024

View reviewed changes

revans2 reviewed Dec 2, 2024

View reviewed changes

binmahone mentioned this pull request Dec 4, 2024

address some comments for 11792 #11816

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fix issue 11790 #11792

[BUG] Fix issue 11790 #11792

binmahone commented Nov 29, 2024 •

edited

Loading

binmahone commented Nov 29, 2024

binmahone commented Nov 29, 2024

binmahone commented Nov 29, 2024

gerashegalov Nov 30, 2024

binmahone Dec 4, 2024

revans2 Dec 6, 2024

binmahone Dec 9, 2024

gerashegalov Nov 30, 2024

binmahone Dec 4, 2024

revans2 Dec 6, 2024

abellina commented Dec 2, 2024

revans2 left a comment

revans2 Dec 2, 2024

binmahone commented Dec 4, 2024

[BUG] Fix issue 11790 #11792

[BUG] Fix issue 11790 #11792

Conversation

binmahone commented Nov 29, 2024 • edited Loading

binmahone commented Nov 29, 2024

binmahone commented Nov 29, 2024

binmahone commented Nov 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abellina commented Dec 2, 2024

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binmahone commented Dec 4, 2024

binmahone commented Nov 29, 2024 •

edited

Loading