Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] explore alternative hashing for non-shuffle partitioning #11900

Open
revans2 opened this issue Dec 20, 2024 · 0 comments
Open

[FEA] explore alternative hashing for non-shuffle partitioning #11900

revans2 opened this issue Dec 20, 2024 · 0 comments
Labels
? - Needs Triage Need team to review and classify feature request New feature or request performance A performance related task/issue reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@revans2
Copy link
Collaborator

revans2 commented Dec 20, 2024

Is your feature request related to a problem? Please describe.
I recently spent some time debugging an issue with some tests for hash aggregate that have a really small target batch size. This exposed an interesting realization that the way Spark handles nulls can cause an excess amount of hash collisions. I personally consider this a bug in Spark's hashing code. We cannot fix it in Spark itself because that could result in incompatibility with Spark itself.

But for internal code where we repartition data for joins or hash aggregates, we should look at using an alternative algorithm that does not have these limitations. I am not sure if the CUDF hash implementation is better, but we probably want to try it out and see. Both in terms of performance to compute the hash and in terms of dealing with nulls.

We had to do a lot to make the hash code compatible with Spark, so I suspect that it may also be slowing down some of the computation being done.

@revans2 revans2 added ? - Needs Triage Need team to review and classify feature request New feature or request performance A performance related task/issue reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request performance A performance related task/issue reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

No branches or pull requests

1 participant