Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: materialize the result to handle non-deterministic functions #24
base: main
Are you sure you want to change the base?
RFC: materialize the result to handle non-deterministic functions #24
Changes from all commits
b340a39
9b0444e
f43a16a
d66e5d2
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just came up with a problem: IIRC currently UDF is just an expression, which can be used in other places beside Project, e.g., Filter. Should we make them occur only in Project? Besides the determinism problem, there are also optimizations like risingwavelabs/risingwave#8703
cc @wangrunji0408 @TennyZhuang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, https://github.com/risingwavelabs/rfcs/pull/12/files#diff-135315ddd7b7929c8daf439f35043ba3c168ce620ab72c56740a6055d15e8f9fR141 refused to add a new operator, but it seems we eventually need to add some 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, I need to clarify that the "UDF" and "non-deterministic expression" are independent concepts.
PROC_TIME()
andRAND()
are built-in expressions but they are non-deterministic. A user-defined function can also be deterministic if a user makes a contract with us.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And materialized Project is just used for non-deterministic expression on an "updatable stream". We can use a normal project on the append-only stream for non-deterministic expression too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree to refuse to add a new operator for UDF or async expression but not for non-deterministic expression. In another word, if we did add a "UDFProject", now we maybe have to accept
StreamProject
,StreamAsyncProject
,StreamMaterializeProject
,StreamAsyncMaterializeProject
😅 🥵There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, after thinking a while I feel that although conceptually different, practically they cannot be so cleanly separated 😅:
Since we want to "always materializing UDF" (except for rare cases and used with caution),
MaterializeProject
would become the main executor for UDFs. Optimizations for UDF inProject
like risingwavelabs/risingwave#8703 would be in vain. On the other hand, do you think we should optimizeMaterializeProject
for async execution?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I'm wrong again😇😇😇. I forgot append-only stream. But the points might still apply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all the optimization for async execution should be done behind the expression framework 🤔 Well there also are some executor-level optimization. For example, some users might not care about the order of the records on the append-only stream and we might reorder them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that enough? If so we can only tweak the performance for expr on one chunk, but not amoung chunks, like
buffered
in https://github.com/risingwavelabs/risingwave/pull/8703/filesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, actually I don’t mind adding many different executor variants in the future, just like the TopN variants 😅
Let’s conclude the discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Find a problem: not all expressions can be extracted to a
Project
easily, e.g., inFILTER
clause. 🤡This is mentioned in #12 as a reason against a dedicated operator for UDF.