Skip to content

Improve async_udf example and docs #16846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 24, 2025
Merged

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 21, 2025

Which issue does this PR close?

Rationale for this change

I was working on writing up DataFusion 49 release notes

and I wasn't able to find the async docs quite as nicely as I wanted so I got inspired and went through the example to showcase the LLM usecase a bit more.

What changes are included in this PR?

  1. Add async_udf.rs example to examples index page
  2. Tweak the library user guide a bit
  3. Upgrade the async_udf.rs example to focus on an LLM usecase to try and make it clearer how the async thing benefits

Are these changes tested?

By existing CI

Are there any user-facing changes?

Better docs

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jul 21, 2025
fn default() -> Self {
Self::new()
}
}

impl AsyncUpper {
impl AskLLM {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

@@ -419,6 +432,7 @@ impl ScalarUDFImpl for AsyncUpper {
Ok(DataType::Utf8)
}

// Note the normal invoke_with_args method is not called for Async UDFs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wish it didn't exist :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could potentially refactor the traits a bit more to avoid this, but I think that might cause more API churn than the value it added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We certainly should.
The functions we plug into LP shouldn't have to answer things like name, aliases, description, schema_name.
It's SQL frontend stuff, not the LP stuff.

Comment on lines +452 to +453
/// to the normal `invoke_with_args` except it returns an `ArrayRef`
/// instead of `ColumnarValue` and is `async`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's obviously async, but why ArrayRef instead of ColumnarValue?
Is this inconsistency warranted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if there is any reason this API doesn't return ColumnarValue consistently with the ScalarUDFImpl -- @goldmedal do you remember if there is any reason?

If not, I can file a ticket to make the APIs consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot added the core Core DataFusion crate label Jul 22, 2025

// Similarly to regular UDFs, you create an AsyncScalarUDF by implementing
// `AsyncScalarUDFImpl` and creating an instance of `AsyncScalarUDF`.
let async_equal = AskLLM::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow the trend 🤣

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 I am not above using the latest shiny tech trend to advertise DataFusion :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this (external service calls) is actually a nice use-case for async functions.
It's good they allow for batching, since they are invoked on a batch of data (as everything in DF).
Do async functions also allow interleaved execution of batches? I.e. can next call to an async UDF start before the previous completed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do async functions also allow interleaved execution of batches? I.e. can next call to an async UDF start before the previous completed?

They are invoked per batch as I understand batch -- so internally the implementation could make 8000 concurrent requests (or pipeline the requests, etc)

There is no pipelineling across batches that I know of -- so the next call to an async UDF will not happen until the first call has completed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no pipelineling across batches

would it be something to consider to keep the pipeline busy? (in case the bottleneck is the remote call, which is not unlikely when query uses the async UDFs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I do think it is likely a good idea. I think the first thing to do might be to create an example / test case showing what this would look like (maybe fetching urls or something) before we design the API

@alamb alamb merged commit d553ffd into apache:main Jul 24, 2025
50 of 51 checks passed
@alamb
Copy link
Contributor Author

alamb commented Jul 24, 2025

Thanks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants