-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Fix issue 11790 #11792
[BUG] Fix issue 11790 #11792
Conversation
Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala
Show resolved
Hide resolved
build |
@revans2 this is a follow up fix for #11712, the nightly CI is all broken so I asked an urgent approval from @GaryShen2008 and @firestarman, and got it checked in as soon as possible . Please double check this PR and let me know any concerns. |
nextLayerBuckets) || repartitionHappened | ||
nextLayerBuckets | ||
if (hashSeed + 7 > 200) { | ||
log.warn("Too many times of repartition, may hit a bug? Size for each batch in " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please file an issue for the bug and link it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @gerashegalov , there's no known bug here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #11834 I agree that this is not a bug, but it is also not defensive, also the check requires tight coupling between the code the sets the hashSeed and this to work properly. I would like us to fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decoupling thing should be fixed by #11816 , please take a look.
"current bucket: " + bucket.map(_.sizeInBytes).mkString(", ") + " rows: " + | ||
bucket.map(_.numRows()).mkString(", ") + " targetMergeBatchSize: " | ||
+ targetMergeBatchSize) | ||
ArrayBuffer.apply(bucket) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could use the syntactic sugar here and elsewhere:
ArrayBuffer.apply(bucket) | |
ArrayBuffer(bucket) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I still see the ArrayBuffer.apply(bucket)???
spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala
Line 285 in 738c8e3
ArrayBuffer.apply(bucket) |
Did you forget to check in your last changes before you merged this? Or is there a follow on issue that you plan on doing.
It wasn't obvious to me that this is now logging a warning and not throwing an exception. @binmahone could you add some context in the description on why we need to stop throwing? Could you also explain in the description why the change in the repartition iterator next call? (https://github.com/NVIDIA/spark-rapids/pull/11792/files#diff-db626a2b4f67ef3e74c3caea3364989ab2ab7cb440b6b087bda770fc4ea2c64cR1083) Just so that in the future we can look at this PR and know at a glance what the bug was. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also concerned like @abellina that your fix actually has turned this into a live lock. We need a hard limit of some kind and we should decouple the hashSeed from that limit so it is simpler to understand what that limit is.
@@ -219,9 +219,6 @@ object AggregateUtils extends Logging { | |||
): Boolean = { | |||
|
|||
var repartitionHappened = false | |||
if (hashSeed > 200) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please decouple the hashSeed for the number of re-partition times being done? I think the code would be much more readable if we could say how many times we tried to re-partition the data instead of "too many times".
Hi @gerashegalov @abellina @revans2 please check if https://github.com/NVIDIA/spark-rapids/pull/11816/files addresses all of your concerns. |
the PR closes #11790 by fixing two randoml exceptions