-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[WIP][SPARK-47031] Union determinism fix #52950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Open question: is not pushing through the union the right thing to do OR is it ok? I'm... fuzzy on this. I think we should mark the attribute ref as non-deterministic though regardless since non-determinism has broader implications but filter pushdown is where I found this. WDYT @cloud-fan ? |
|
oh nvm test failures were not what I thought, sorry for the ping. |
…ons as we do with nullability Co-authored-by: Holden Karau <[email protected]>
Co-authored-by: Holden Karau <[email protected]>
5f4be55 to
00f3972
Compare
|
Ok did more digging and thinking about the test failures. The problem with the current attempted solution is that we have expression which start out non-deterministic and then become deterministic once a seed assigned but the toAttribute is not updated on that and similarily the unions logic on toAttribute would not be updated on that either. One option I considered was passing in a callback (which I don't love of course) but given that we resolve the attrs in the union once that still would break (although if we made it not cached that could be ok). A "partial" fix could be to only propegate the non-deterministic into attr ref IFF it's something where a seed being set would not resolve the non-determinism but I don't see a super clean way to do that in the code right now. I'm going to give this more thought. Maybe a new trait for "SortOfNonDeterministic" and keep the existing NonDeterministic trait alive? |
|
Ok it's not just the RNG ones, SparkPartitionID is also "nondeterministic." which ... hmmm.... |
|
|
||
| // not deterministic in first sub-query | ||
| val originalQuery = Union(Seq(subq2, subq1)) | ||
| .where($"j" > 5L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's wrong with pushing down j > 5 in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the j column should be non-deterministic (union of non-deterministic and deterministic fields) I don't think we can push? Right?
What changes were proposed in this pull request?
We plumb through non-determinism information, as with nullability, from projection to union.
WIP: I need to double check the other set operations.
Why are the changes needed?
We need to be careful pushing through non-deterministic components & other changes.
Does this PR introduce any user-facing change?
Yes, the attribute reference now has a determinism field. For backwards compat, unpack does not unpack that field, and it has a default value.
How was this patch tested?
Added a test in the filter pushdown logic.
Was this patch authored or co-authored using generative AI tooling?
No