Table function UDFs #201

JAicewizard · 2024-04-27T20:03:12Z

This implements table function UDFs.

The function takes in an interface, from which it detects the type of the parameters and return values.
This interface also has a method used to generate new rows

The interface is quite simple, but not ideal. There is an issue that we need to obtain the type of the UDF from the user-input, but we also don't want the user to specify the duckdb types manually. This solution allows the user to return arbitrary types, and they will be matched to duckdb types automatically.

Another limitation is that the value interface only allows retrieving string and int64 types. This is only a limitation for the arguments, however it is a large limitation.

JAicewizard · 2024-04-27T20:04:36Z

PS, dont merge yet, I would like a review but c allocations are handled sloppily, and often not de-allocated. Also more return types can (and should) be implemented still

marcboeker · 2024-04-28T14:23:18Z

@JAicewizard Thanks for the contribution. I think it makes sense to also add a lot of test cases for this feature.

ajzo90 · 2024-04-28T20:16:21Z

Great initiative! Based on this contribution I have experimented with a different API for my own needs. In particular it support predicate pushdown, concurrent scanners, and exposes an API to work with table states.

udf.go
example

JAicewizard · 2024-04-29T12:08:10Z

@ajzo90 Those are some great changes, one thing I didn't want to do however was add a bunch of Chunk and Vector APIs to this library. I agree that they would be great, I think leaving this PR to as un-vectorised, we cal always add a vectorised version later.

In the design of the Chunk and Vectorapis it would be wise to also think of a way we might be able to vectorise the Appender API at the same time. (The scanner type for example looks a lot like something that could be used in the appender).
The first thing I will do it create benchmarks to see what the bottle-neck is, my suspicion is that it is the conversion to any for every value, which should be removable.

I will however definitely look at how you did predicatepushdown, I do not really know what this actually does so I didn't touch it.

JAicewizard · 2024-04-29T12:08:55Z

@JAicewizard Thanks for the contribution. I think it makes sense to also add a lot of test cases for this feature.

I will, and a benchmark as well. This APi is very much not optimal, but I am glad you're open to this PR :)

JAicewizard · 2024-05-01T12:54:32Z

I implemented a generic interface for the vector type which allows me to speed up the UDF by 3x, as it no longer needs to push the values into an interface (at least for primitive types).

However it adds some new functions in appender_vector.go, so I would like a quick look at this code. (note that go does not allow generic parameters on methods).

If all is well I can implement some of the features from ajzos branch to make this feature complete!

JAicewizard · 2024-05-12T16:21:47Z

todo list:

named parameters
max threads
testing of projection pushdown
making sure all allocations are freed

JAicewizard · 2024-05-24T10:28:49Z

I force-pushed to cleanup the history. This PR ended up growing a bit out of hand, especially due to the wanting a safe API for setting the types of columns, without forcing the user to pass us a value. I think the current version is nice, as it the duckdb logical type is fully managed by go-duckdb, while also giving users a nice API.

One thing I am not entirely happy with is the integration with the existing vector type, however I did not feel the need to implement a fully custom one just for table functions. The added code allows for setting values without having to turn them into an any. This significantly speeds up table functions, as converting a value into any does a heap allocation which is expensive.

I think this ready for a review.

taniabogatsch · 2024-05-24T12:43:17Z

Hi @JAicewizard, with the release of duckdb 0.10.3, we now also expose scalar UDFs in the C API: duckdb/duckdb#11786. I am currently looking into adding support for this to go-duckdb. Your PR is a great help! 😊 I noticed that there are many functions that both implementations could benefit from, like your changes to type.go. I am wondering on how to best organise efforts to avoid writing similar code twice. 🤔 Maybe I could work off of your PR, or maybe you have other suggestions?

taniabogatsch · 2024-05-24T12:44:29Z

I am also open to finalising this PR first, and then starting on the scalar UDFs. In that case, I can give a thorough review, or open a PR to your PR with suggestions? Let me know what you think!

taniabogatsch · 2024-05-24T12:53:12Z

Ignore part of my comments, and sorry for any confusion. The scalar UDFs are on our feature branch, so not yet part of the release. However, they will be in the not-too-far future, haha. So I'll just draft the scalar UDFs, and review this PR with that in mind. 😄

JAicewizard · 2024-05-24T22:17:40Z

haha I was confused about scalar UDFs in 10.3 already. I indeed wrote much of the code in a way that in the future scalar UDFs would also be able to benefit from the infrastructure. Thats also why I put the Type and Row in separate files.

I was also thinking about doing scalar UDFs, but you can take that on if you want. I think much of the code can be copied/used for inspiration, but I'm afraid it will be difficult to make much of the callback code shared. Most code is handling parameters and returned values which can't be shared.

After this PR I will start working on another one adding vectorised table UDFs, do you have any ideas on how to implement a nice and fast way to represent a duckdb vector? I was thinking maybe something with a method SetValues[T Any](data []T) which sets the first len(data) values to the specified ones, and clears the others. But I am open for ideas!

A review would definitely be highly appreciated, I can review your code as well once its done!

type.go

JAicewizard · 2024-05-25T12:50:35Z

@taniabogatsch I renamed the files to udtf*.go so that you can create new ones.
It might also be nice for scalar UDFs to be able to extract the type information from the function being passed like in python.

ajzo90

Nice job with the UDF implementation.
I left a few comments related to projection pushdown and threading. In my opinion it make sense to wait with these (performance related) features until we know where we want to go with vectorised APIs.

udf.go

JAicewizard · 2024-05-27T08:08:52Z

Thank you for the review! I will add some documentation as well soon-ish since that is very much still missing.

JAicewizard · 2024-06-16T12:35:40Z

I rebased and forcepushed to integrate datachunk changes. Also added a datachunk API for tablefunctions.

JAicewizard · 2024-06-16T14:19:57Z

@taniabogatsch Do you have write access? if so could you run CI for this? I think I fixed all the issues in CI, but I would like to make sure.

Also do you know what the next steps are to getting this merged?

taniabogatsch · 2024-06-18T07:23:44Z

@JAicewizard, I ran the CI. Regarding the next steps to getting this merged, ideally, we have the DataChunk support in place. But I'll have another look at the current state, with the changes from main, and maybe we can merge it sooner.

taniabogatsch

Hey! I just had another more detailed look at the PR. This is going in a great direction. 😄 I think we still need to iron out two main things before we add the 'finishing touches' to the PR.

Changes to vector and the custom functions there. See my comment on the respective file.
Discuss the Type approach.

Other remarks.

Could you also run gofumpt -l -w . on your PR for more uniform formatting?
It is nice to see the increased number of tests.
Much improved documentation in udtf.go 👍

examples/udf/udf.go

udtf.go

Makefile

data_chunk.go

errors.go

vector.go

type.go

JAicewizard

I made some changes I have not pushed yet, ill push them soon.

type_info.go

JAicewizard · 2024-09-17T20:33:07Z

Rebased to the latest, and added tests for the parallel functions just like the normal ones.

I also made some small performance improvements WRT Row, and especially when using the chunk API it is very fast. The biggest bottleneck is importing the chunks from duckdb. Maybe in the future this can be sped up as well!

taniabogatsch · 2024-09-18T11:38:47Z

I've deleted some duplicate comments and reviewed the PR again to align it with the scalar UDF PR.

Could you move the changes in the vector*.go files into a separate PR? My understanding is that they generally speed up go-duckdb by using fewer any types. That separate PR should then be fairly self-contained and quickly merged.

taniabogatsch

I just looked at this PR again before merging the scalar UDF PR. :)
We should be able to reuse a few functions from there and finally finish up this PR.
My comments are primarily nits and naming suggestions to align the PR with DuckDB and go-duckdb naming conventions.

Makefile

deps/darwin_amd64/libduckdb.a

errors.go

row.go

udtf.go

taniabogatsch

I left a few comments about the included example.
Beyond that, I'll try to review the changes in udtf.go and udtf_test.go tomorrow or later this week.

examples/table_udf/main.go

udf_utils.go

JAicewizard · 2024-09-25T09:37:13Z

Thanks for the PR, I added a parallel example. While doing so I realised the parallel test was completely wrong and parallel functions did not work at all, so I fixed those too.

JAicewizard · 2024-09-25T09:37:33Z

Uhh oops, I mean review instead of PR

taniabogatsch

I've left more comments, but this is probably almost done. Most of my comments are about test coverage.
It's good to hear that you found and fixed that bug in the parallel execution. 😄

taniabogatsch · 2024-09-25T09:10:53Z

udtf_test.go

Should we add a few more tests?

The input parameter differs from the standard vector size.

Test all types in getValue, possibly in a loop with a function that emits a constant value or similar.

Is it possible to have a table function that does not take any parameters?

taniabogatsch · 2024-09-25T09:13:27Z

udtf.go

Minor nit: To be consistent with the scalar functions, let's rename the files to table_udf.go and table_udf_test.go.

taniabogatsch · 2024-09-25T10:32:25Z

udtf.go

+			if err != nil {
+				setFuncError(info, err.Error())
+			}


Should we abort if we encounter an error?
Same for the ThreadedRowTableSource case.

taniabogatsch · 2024-09-25T10:33:48Z

examples/table_udf_basic/main.go

+	if err != nil {
+		log.Fatal(err)
+	}


taniabogatsch · 2024-09-25T10:35:08Z

examples/table_udf_parallel/main.go

+	if err != nil {
+		log.Fatal(err)
+	}


JAicewizard force-pushed the main branch from de25eb9 to aec1205 Compare May 4, 2024 14:05

JAicewizard force-pushed the main branch from 451c660 to 7454a0f Compare May 12, 2024 16:29

JAicewizard force-pushed the main branch from a2af5e3 to 0ff54bd Compare May 24, 2024 10:19

JAicewizard commented May 24, 2024

View reviewed changes

type.go Outdated Show resolved Hide resolved

ajzo90 reviewed May 26, 2024

View reviewed changes

udf.go Outdated Show resolved Hide resolved

udf.go Outdated Show resolved Hide resolved

This was referenced May 27, 2024

API draft to create and scan data chunks #219

Closed

[Feature] Scalar UDF support #222

Merged

taniabogatsch added the feature [feature] request or PR label Jun 6, 2024

JAicewizard force-pushed the main branch from 253bf9e to e8e3d80 Compare June 16, 2024 09:54

taniabogatsch requested changes Jun 20, 2024

View reviewed changes

JAicewizard force-pushed the main branch from f3d23ae to c8f828a Compare July 13, 2024 16:21

JAicewizard commented Jul 13, 2024

View reviewed changes

JAicewizard force-pushed the main branch from 9bf034a to 6bfcf24 Compare July 13, 2024 18:58

taniabogatsch mentioned this pull request Sep 10, 2024

[Feature] Type interface #272

Merged

JAicewizard force-pushed the main branch from df21c9f to 9ffa307 Compare September 14, 2024 11:54

taniabogatsch reviewed Sep 16, 2024

View reviewed changes

type_info.go Outdated Show resolved Hide resolved

JAicewizard force-pushed the main branch from 6250c4c to c4ba790 Compare September 17, 2024 18:39

Repository owner deleted a comment from JAicewizard Sep 18, 2024

taniabogatsch mentioned this pull request Sep 20, 2024

Add support for create_function() #267

Closed

JAicewizard force-pushed the main branch from b312ea3 to 6cd997a Compare September 22, 2024 09:49

taniabogatsch requested changes Sep 23, 2024

View reviewed changes

JAicewizard force-pushed the main branch from bda27a5 to dc9e6b7 Compare September 24, 2024 09:48

JAicewizard added 3 commits September 24, 2024 11:50

Add a Row type, representing a single row in a chunk

008c9e7

Add User Defined TableFunctions support

4e4972b

Add an example for UDTFs

c25b9b7

JAicewizard force-pushed the main branch 2 times, most recently from e16352e to df3f51c Compare September 24, 2024 10:27

Post rebase fixups, and share more with scalar UDF

ce66a7c

JAicewizard force-pushed the main branch from df3f51c to ce66a7c Compare September 24, 2024 13:07

Fix format

78e1a43

taniabogatsch reviewed Sep 24, 2024

View reviewed changes

Fix issues with the example

4ecdd1c

JAicewizard force-pushed the main branch from 1e1060b to 4ecdd1c Compare September 25, 2024 08:01

Add parallel example, and fix parallel table function

453ee5c

JAicewizard and others added 2 commits September 25, 2024 11:41

Make linter happy too

cf91897

Re-build static libraries

812c94f

taniabogatsch reviewed Sep 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table function UDFs #201

Table function UDFs #201

JAicewizard commented Apr 27, 2024

JAicewizard commented Apr 27, 2024

marcboeker commented Apr 28, 2024

ajzo90 commented Apr 28, 2024

JAicewizard commented Apr 29, 2024

JAicewizard commented Apr 29, 2024

JAicewizard commented May 1, 2024

JAicewizard commented May 12, 2024 •

edited

Loading

JAicewizard commented May 24, 2024

taniabogatsch commented May 24, 2024

taniabogatsch commented May 24, 2024

taniabogatsch commented May 24, 2024

JAicewizard commented May 24, 2024

JAicewizard commented May 25, 2024

ajzo90 left a comment

JAicewizard commented May 27, 2024

JAicewizard commented Jun 16, 2024

JAicewizard commented Jun 16, 2024

taniabogatsch commented Jun 18, 2024

taniabogatsch left a comment

JAicewizard left a comment

JAicewizard commented Sep 17, 2024

taniabogatsch commented Sep 18, 2024

taniabogatsch left a comment

taniabogatsch left a comment

JAicewizard commented Sep 25, 2024

JAicewizard commented Sep 25, 2024

taniabogatsch left a comment

taniabogatsch Sep 25, 2024

taniabogatsch Sep 25, 2024

taniabogatsch Sep 25, 2024

taniabogatsch Sep 25, 2024

taniabogatsch Sep 25, 2024

Table function UDFs #201

Are you sure you want to change the base?

Table function UDFs #201

Conversation

JAicewizard commented Apr 27, 2024

JAicewizard commented Apr 27, 2024

marcboeker commented Apr 28, 2024

ajzo90 commented Apr 28, 2024

JAicewizard commented Apr 29, 2024

JAicewizard commented Apr 29, 2024

JAicewizard commented May 1, 2024

JAicewizard commented May 12, 2024 • edited Loading

JAicewizard commented May 24, 2024

taniabogatsch commented May 24, 2024

taniabogatsch commented May 24, 2024

taniabogatsch commented May 24, 2024

JAicewizard commented May 24, 2024

JAicewizard commented May 25, 2024

ajzo90 left a comment

Choose a reason for hiding this comment

JAicewizard commented May 27, 2024

JAicewizard commented Jun 16, 2024

JAicewizard commented Jun 16, 2024

taniabogatsch commented Jun 18, 2024

taniabogatsch left a comment

Choose a reason for hiding this comment

JAicewizard left a comment

Choose a reason for hiding this comment

JAicewizard commented Sep 17, 2024

taniabogatsch commented Sep 18, 2024

taniabogatsch left a comment

Choose a reason for hiding this comment

taniabogatsch left a comment

Choose a reason for hiding this comment

JAicewizard commented Sep 25, 2024

JAicewizard commented Sep 25, 2024

taniabogatsch left a comment

Choose a reason for hiding this comment

taniabogatsch Sep 25, 2024

Choose a reason for hiding this comment

taniabogatsch Sep 25, 2024

Choose a reason for hiding this comment

taniabogatsch Sep 25, 2024

Choose a reason for hiding this comment

taniabogatsch Sep 25, 2024

Choose a reason for hiding this comment

taniabogatsch Sep 25, 2024

Choose a reason for hiding this comment

JAicewizard commented May 12, 2024 •

edited

Loading