Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fix issue 11790 #11792

Merged
merged 1 commit into from
Nov 29, 2024
Merged

[BUG] Fix issue 11790 #11792

merged 1 commit into from
Nov 29, 2024

Conversation

binmahone
Copy link
Collaborator

@binmahone binmahone commented Nov 29, 2024

the PR closes #11790 by fixing two randoml exceptions

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
@binmahone
Copy link
Collaborator Author

build

@binmahone binmahone self-assigned this Nov 29, 2024
@pxLi pxLi added the bug Something isn't working label Nov 29, 2024
@pxLi pxLi changed the title fix issue 11790 [BUG] Fix issue 11790 Nov 29, 2024
@binmahone
Copy link
Collaborator Author

build

@binmahone binmahone merged commit aa2da41 into NVIDIA:branch-24.12 Nov 29, 2024
50 checks passed
@binmahone
Copy link
Collaborator Author

@revans2 this is a follow up fix for #11712, the nightly CI is all broken so I asked an urgent approval from @GaryShen2008 and @firestarman, and got it checked in as soon as possible . Please double check this PR and let me know any concerns.

nextLayerBuckets) || repartitionHappened
nextLayerBuckets
if (hashSeed + 7 > 200) {
log.warn("Too many times of repartition, may hit a bug? Size for each batch in " +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file an issue for the bug and link it here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @gerashegalov , there's no known bug here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #11834 I agree that this is not a bug, but it is also not defensive, also the check requires tight coupling between the code the sets the hashSeed and this to work properly. I would like us to fix that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decoupling thing should be fixed by #11816 , please take a look.

"current bucket: " + bucket.map(_.sizeInBytes).mkString(", ") + " rows: " +
bucket.map(_.numRows()).mkString(", ") + " targetMergeBatchSize: "
+ targetMergeBatchSize)
ArrayBuffer.apply(bucket)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could use the syntactic sugar here and elsewhere:

Suggested change
ArrayBuffer.apply(bucket)
ArrayBuffer(bucket)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I still see the ArrayBuffer.apply(bucket)???

Did you forget to check in your last changes before you merged this? Or is there a follow on issue that you plan on doing.

@abellina
Copy link
Collaborator

abellina commented Dec 2, 2024

It wasn't obvious to me that this is now logging a warning and not throwing an exception. @binmahone could you add some context in the description on why we need to stop throwing?

Could you also explain in the description why the change in the repartition iterator next call? (https://github.com/NVIDIA/spark-rapids/pull/11792/files#diff-db626a2b4f67ef3e74c3caea3364989ab2ab7cb440b6b087bda770fc4ea2c64cR1083)

Just so that in the future we can look at this PR and know at a glance what the bug was.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also concerned like @abellina that your fix actually has turned this into a live lock. We need a hard limit of some kind and we should decouple the hashSeed from that limit so it is simpler to understand what that limit is.

@@ -219,9 +219,6 @@ object AggregateUtils extends Logging {
): Boolean = {

var repartitionHappened = false
if (hashSeed > 200) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please decouple the hashSeed for the number of re-partition times being done? I think the code would be much more readable if we could say how many times we tried to re-partition the data instead of "too many times".

@binmahone
Copy link
Collaborator Author

Hi @gerashegalov @abellina @revans2 please check if https://github.com/NVIDIA/spark-rapids/pull/11816/files addresses all of your concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
7 participants