-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Improve async_udf example and docs #16846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fn default() -> Self { | ||
Self::new() | ||
} | ||
} | ||
|
||
impl AsyncUpper { | ||
impl AskLLM { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
@@ -419,6 +432,7 @@ impl ScalarUDFImpl for AsyncUpper { | |||
Ok(DataType::Utf8) | |||
} | |||
|
|||
// Note the normal invoke_with_args method is not called for Async UDFs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wish it didn't exist :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could potentially refactor the traits a bit more to avoid this, but I think that might cause more API churn than the value it added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We certainly should.
The functions we plug into LP shouldn't have to answer things like name, aliases, description, schema_name.
It's SQL frontend stuff, not the LP stuff.
/// to the normal `invoke_with_args` except it returns an `ArrayRef` | ||
/// instead of `ColumnarValue` and is `async`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's obviously async, but why ArrayRef instead of ColumnarValue?
Is this inconsistency warranted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if there is any reason this API doesn't return ColumnarValue
consistently with the ScalarUDFImpl -- @goldmedal do you remember if there is any reason?
If not, I can file a ticket to make the APIs consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed a ticket to track:
|
||
// Similarly to regular UDFs, you create an AsyncScalarUDF by implementing | ||
// `AsyncScalarUDFImpl` and creating an instance of `AsyncScalarUDF`. | ||
let async_equal = AskLLM::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow the trend 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 I am not above using the latest shiny tech trend to advertise DataFusion :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this (external service calls) is actually a nice use-case for async functions.
It's good they allow for batching, since they are invoked on a batch of data (as everything in DF).
Do async functions also allow interleaved execution of batches? I.e. can next call to an async UDF start before the previous completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do async functions also allow interleaved execution of batches? I.e. can next call to an async UDF start before the previous completed?
They are invoked per batch as I understand batch
-- so internally the implementation could make 8000 concurrent requests (or pipeline the requests, etc)
There is no pipelineling across batches that I know of -- so the next call to an async UDF will not happen until the first call has completed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no pipelineling across batches
would it be something to consider to keep the pipeline busy? (in case the bottleneck is the remote call, which is not unlikely when query uses the async UDFs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I do think it is likely a good idea. I think the first thing to do might be to create an example / test case showing what this would look like (maybe fetching urls or something) before we design the API
Thanks again |
Which issue does this PR close?
Rationale for this change
I was working on writing up DataFusion 49 release notes
49.0.0
release post datafusion-site#91and I wasn't able to find the async docs quite as nicely as I wanted so I got inspired and went through the example to showcase the LLM usecase a bit more.
What changes are included in this PR?
async_udf.rs
example to examples index pageAre these changes tested?
By existing CI
Are there any user-facing changes?
Better docs