-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Table function UDFs #201
base: main
Are you sure you want to change the base?
Table function UDFs #201
Conversation
PS, dont merge yet, I would like a review but c allocations are handled sloppily, and often not de-allocated. Also more return types can (and should) be implemented still |
@JAicewizard Thanks for the contribution. I think it makes sense to also add a lot of test cases for this feature. |
@ajzo90 Those are some great changes, one thing I didn't want to do however was add a bunch of In the design of the I will however definitely look at how you did predicatepushdown, I do not really know what this actually does so I didn't touch it. |
I will, and a benchmark as well. This APi is very much not optimal, but I am glad you're open to this PR :) |
I implemented a generic interface for the However it adds some new functions in If all is well I can implement some of the features from ajzos branch to make this feature complete! |
todo list:
|
I force-pushed to cleanup the history. This PR ended up growing a bit out of hand, especially due to the wanting a safe API for setting the types of columns, without forcing the user to pass us a value. I think the current version is nice, as it the duckdb logical type is fully managed by go-duckdb, while also giving users a nice API. One thing I am not entirely happy with is the integration with the existing I think this ready for a review. |
Hi @JAicewizard, with the release of duckdb 0.10.3, we now also expose scalar UDFs in the C API: duckdb/duckdb#11786. I am currently looking into adding support for this to go-duckdb. Your PR is a great help! 😊 I noticed that there are many functions that both implementations could benefit from, like your changes to |
I am also open to finalising this PR first, and then starting on the scalar UDFs. In that case, I can give a thorough review, or open a PR to your PR with suggestions? Let me know what you think! |
Ignore part of my comments, and sorry for any confusion. The scalar UDFs are on our |
haha I was confused about scalar UDFs in 10.3 already. I indeed wrote much of the code in a way that in the future scalar UDFs would also be able to benefit from the infrastructure. Thats also why I put the I was also thinking about doing scalar UDFs, but you can take that on if you want. I think much of the code can be copied/used for inspiration, but I'm afraid it will be difficult to make much of the callback code shared. Most code is handling parameters and returned values which can't be shared. After this PR I will start working on another one adding vectorised table UDFs, do you have any ideas on how to implement a nice and fast way to represent a duckdb vector? I was thinking maybe something with a method A review would definitely be highly appreciated, I can review your code as well once its done! |
@taniabogatsch I renamed the files to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job with the UDF implementation.
I left a few comments related to projection pushdown and threading. In my opinion it make sense to wait with these (performance related) features until we know where we want to go with vectorised APIs.
Thank you for the review! I will add some documentation as well soon-ish since that is very much still missing. |
I rebased and forcepushed to integrate datachunk changes. Also added a datachunk API for tablefunctions. |
@taniabogatsch Do you have write access? if so could you run CI for this? I think I fixed all the issues in CI, but I would like to make sure. Also do you know what the next steps are to getting this merged? |
@JAicewizard, I ran the CI. Regarding the next steps to getting this merged, ideally, we have the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! I just had another more detailed look at the PR. This is going in a great direction. 😄 I think we still need to iron out two main things before we add the 'finishing touches' to the PR.
- Changes to
vector
and the custom functions there. See my comment on the respective file. - Discuss the
Type
approach.
Other remarks.
- Could you also run
gofumpt -l -w .
on your PR for more uniform formatting? - It is nice to see the increased number of tests.
- Much improved documentation in
udtf.go
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some changes I have not pushed yet, ill push them soon.
Rebased to the latest, and added tests for the parallel functions just like the normal ones. I also made some small performance improvements WRT |
I've deleted some duplicate comments and reviewed the PR again to align it with the scalar UDF PR. Could you move the changes in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just looked at this PR again before merging the scalar UDF PR. :)
We should be able to reuse a few functions from there and finally finish up this PR.
My comments are primarily nits and naming suggestions to align the PR with DuckDB and go-duckdb naming conventions.
e16352e
to
df3f51c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments about the included example.
Beyond that, I'll try to review the changes in udtf.go
and udtf_test.go
tomorrow or later this week.
Thanks for the PR, I added a parallel example. While doing so I realised the parallel test was completely wrong and parallel functions did not work at all, so I fixed those too. |
Uhh oops, I mean review instead of PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left more comments, but this is probably almost done. Most of my comments are about test coverage.
It's good to hear that you found and fixed that bug in the parallel execution. 😄
udtf_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a few more tests?
- The input parameter differs from the standard vector size.
- Test all types in
getValue
, possibly in a loop with a function that emits a constant value or similar. - Is it possible to have a table function that does not take any parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: To be consistent with the scalar functions, let's rename the files to table_udf.go
and table_udf_test.go
.
if err != nil { | ||
setFuncError(info, err.Error()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we abort if we encounter an error?
Same for the ThreadedRowTableSource
case.
if err != nil { | ||
log.Fatal(err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove.
if err != nil { | ||
log.Fatal(err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove.
This implements table function UDFs.
The function takes in an interface, from which it detects the type of the parameters and return values.
This interface also has a method used to generate new rows
The interface is quite simple, but not ideal. There is an issue that we need to obtain the type of the UDF from the user-input, but we also don't want the user to specify the duckdb types manually. This solution allows the user to return arbitrary types, and they will be matched to duckdb types automatically.
Another limitation is that the value interface only allows retrieving string and int64 types. This is only a limitation for the arguments, however it is a large limitation.