Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Decouple the retry logic from the see in GpuAggregateExec and have a retry limit #11834

Open
revans2 opened this issue Dec 6, 2024 · 0 comments
Assignees
Labels
task Work required that improves the product but is not user facing tech debt

Comments

@revans2
Copy link
Collaborator

revans2 commented Dec 6, 2024

Describe the bug
In GpuAggregateExec we can re-partition data if it is too large to fit on the GPU. But if we get unlucky and the hashes skew to not enough buckets we might need to partition the data again. Currently this is done by updating the hash seed. and trying again.

Some recent changes https://github.com/NVIDIA/spark-rapids/pull/11792/files removed a limit on the number of repartions that we can do. But the warning is printed out when some cryptic code if (hasSeed +7 > 200).

We should have the hash seed only be a hash seed and not need to carry carry information about how many times a repatition has happened. We should also have a limit on the number of repartitions that we do, just so if something bad happens we don't get into a live lock situation. That limit can be huge like 20, and we can have a separate limit to log a warning, hopefully with more human readable code.

@revans2 revans2 added ? - Needs Triage Need team to review and classify task Work required that improves the product but is not user facing tech debt labels Dec 6, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing tech debt
Projects
None yet
Development

No branches or pull requests

2 participants