-
Notifications
You must be signed in to change notification settings - Fork 106
feat: Support images and PDFs in tool results #735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
By moving these content types out of the tool results and into the abstract user turn
This better links the content to its source, but generally hides the markup from user view (shinychat doesn't show the XML tags in assistant output).
Rather than the ad-hoc AsIs data structure, adopt tidyverse/ellmer#735's approach
| # ellmer (development version) | ||
|
|
||
| * ellmer now supports tools that return image or PDF content types, for example using `content_image_file()` or `content_image_pdf()`. (#735) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW new style is no empty line between bullets
| is_tool <- map_lgl(x@contents, S7_inherits, ContentToolResult) | ||
| content <- as_json(provider, x@contents[!is_tool], ...) | ||
| if (length(content) > 0) { | ||
| data <- tool_results_separate_content(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to update chat_openai_responses() too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so! I merged the changes but have been in meetings all day; I'll pick this back up on Monday and will take a look at chat_openai_responses() then too.
| chat <- chat_openai() | ||
| #> Using model = "gpt-4.1". | ||
| chat$register_tool(screenshot_website) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice example!
| turn_matched <- match_tools(turn, tools) | ||
| expect_equal(turn_matched, fixture_turn_with_tool_requests(with_tool = TRUE)) | ||
| }) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a couple of end-to-end tests of a tool that returns an image and the chat understands it? Maybe just for anthropic + gemini + openai? Definitely need to use cassettes.
| Turn("user", contents = results[is_tool_result]) | ||
| } | ||
|
|
||
| is_extra_content <- function(x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expected to see a check_tool_result() somewhere that would check that a tool call returns either a string, a ContentType, or a list of ContentTypes. Without that this test check feels a bit wibbly to me.
|
|
||
| tool_results_separate_content <- function(turn) { | ||
| if (!some(turn@contents, is_tool_result)) { | ||
| return(list(tool_results = list(), contents = turn@contents)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that most of the providers do c(data$tool_results, data$contents) I think you should put the contents first in the list. But can you tell me more about why this just doesn't return a single list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In almost all cases, the tool results need to be first in the content list in the user turn. Most APIs will error out if the tool results aren't immediately next in the messages list after the assistant turn. So that motivated the "tool results first" in this helper.
They're two separate items in the list because some APIs want tool results in a set of separate messages. In particular, I believe this is the case with OpenAI. I thought it'd be better to return separate items and combine them as needed than to need to repeat the filtering in some places later. Secondarily, I also liked that the naming makes it clear that we're ordering content as tool results then contents when we do so, rather than hiding that detail in the helper function.
(Those are loose preferences though...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that separating them out might introduce some subtle bug because we might now be reordering them. But more importantly, I think the reification of our data structure to what the provider expects is best keep close to the provider, and given that OpenAI is the exception rather than the rule, I think this should just return a simple list. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're both identifying the same risk: provider APIs certainly care about how content and tool results are ordered, and sometimes they don't even want them passed together in the same message.
separating them out might introduce some subtle bug because we might now be reordering them
Interestingly, we have this exact subtle bug in ellmer currently where Deepseek and Openai both have code like this:
# Deepseek
c(texts_out, tools_out)
# OpenAI (/completions)
# (...after moving tools into their own messages)
c(user, tools)These will both cause API errors because the order needs to be reversed: tools, then text. We never hit those errors because it's unnatural in current usage.
Given that providers have opinions about the ordering of tool results and text, my take is that it's better to require that we explicitly make a choice in the provider implementations about how they are ordered.
I think the reification of our data structure to what the provider expects is best keep close to the provider,
I would say I'm making this exact argument. Where and how the tool results are placed in the JSON sent to the API is a provider expectation, and it being okay to have results and other content in the same message (with results first and content second) is explicitly a provider choice; not all providers do this. A flat list of c(tool_results, contents) is the most common ordering, but as for OpenAI different providers may make their own choices.
Anyway, I'm not super attached to this and I'll be happy to implement the flat list to see how that feels. I'm just giving some background because I feel we're thinking of the same risks and approaching this with similar design philosophies, just ending up with slightly different conclusions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, that's reasonable. I think my intuition about this design is coming from my strong belief that the Claude model is "correct" and the openAI is clearly inferior (because of the way that Claude makes it clear that there's a user turn consisting of various types of content, then an assistant turn consisting of various types of content).
That said, this code is only used for non-Claude APIs so maybe that doesn't apply.
But I still feel like there are two things happening here — first we flatten tool requests with multiple content types and then we pull out the tool results into their own list for providers that need it. So maybe I'd prefer an API like tool_results_flatten() followed by tool_results_split(). But I haven't closely read the implementation, so it might be that this would make the code more complicated and/or less clear.
This PR adds support for tool results to return images or PDF results.
This isn't a feature that's widely supported in provider APIs, but we get around this limitation by moving image and PDF content out of the tool result and into the abstract user turn that carries the tool results.
We support two cases:
content_image()orcontent_pdf()as a tool result.In all cases, we replace the value in the tool result with
"[see below]"(or"[see below: item N]"in the list case) and we wrap the extra content in<content tool-call-id="abc123" item="N">...content...</content>XML tags.Notes
as_json()methods forTurnand updated them to returntool_message, user_message.tool_string()doesn't support having these content types in the tool result because it callsjsonlite::toJSON(). I updated this function so that internally we can force the JSON conversion for printing, but require this work for the actual tool results that we send across the wire. If it fails, it now fails with a more informative error message. (Internally we call this function when echoing the tool result, before we've pulled out the content types.)Example