Skip to content

Conversation

@gadenbuie
Copy link
Collaborator

@gadenbuie gadenbuie commented Aug 27, 2025

This PR adds support for tool results to return images or PDF results.

This isn't a feature that's widely supported in provider APIs, but we get around this limitation by moving image and PDF content out of the tool result and into the abstract user turn that carries the tool results.

We support two cases:

  • Directly returning a content_image() or content_pdf() as a tool result.
  • Returning a list that contains, at most one level deep, these content types.

In all cases, we replace the value in the tool result with "[see below]" (or "[see below: item N]" in the list case) and we wrap the extra content in <content tool-call-id="abc123" item="N">...content...</content> XML tags.

Notes

  • OpenAI requires that tool results are a separate message and follow the assistant message. This appears to be common among providers when the tool results are separated. I checked all as_json() methods for Turn and updated them to return tool_message, user_message.
  • tool_string() doesn't support having these content types in the tool result because it calls jsonlite::toJSON(). I updated this function so that internally we can force the JSON conversion for printing, but require this work for the actual tool results that we send across the wire. If it fails, it now fails with a more informative error message. (Internally we call this function when echoing the tool result, before we've pulled out the content types.)

Example

pkgload::load_all()
#> ℹ Loading ellmer

get_cat_image <- function() {
  size <- sample(200:300, 1)
  url <- sprintf("https://placecats.com/%d/%d", size, size)

  tmpf <- withr::local_tempfile(fileext = ".jpg")
  download.file(url, tmpf, quiet = TRUE)

  content_image_file(tmpf, resize = "none")
}

chat <- chat("openai/gpt-5-nano", echo = "none")
# chat <- chat("anthropic")
# chat <- chat("google_gemini")
# chat <- chat_deepseek(echo = "output")
# There aren't many tool+vision Ollama models, but this one should work (but not on my M1)
# chat <- chat_ollama(model = "mistral-small3.2", echo = "output")
chat$register_tool(
  tool(
    function(n_images = 1) {
      if (n_images == 1) {
        get_cat_image()
      } else {
        lapply(seq_len(n_images), function(i) get_cat_image())
      }
    },
    name = "get_cat_image",
    description = "Gets a random cat image.",
    arguments = list(
      n_images = type_integer("Number of cat images to get at once.")
    )
  )
)

. <- chat$chat(
  "Get a random cat image and describe what the cat is feeling."
)
. <- chat$chat(
  "Get 2 random cat images and describe what the cats are feeling."
)
chat
#> <Chat OpenAI/gpt-5-nano turns=8 tokens=1826/1942 $0.00>
#> ── user [149] ──────────────────────────────────────────────────────────────────
#> Get a random cat image and describe what the cat is feeling.
#> ── assistant [281] ─────────────────────────────────────────────────────────────
#> [tool request (call_jjoIvBbPW336sG0FWh6U9b5U)]: get_cat_image(n_images = 1L)
#> ── user [-62] ──────────────────────────────────────────────────────────────────
#> [tool result  (call_jjoIvBbPW336sG0FWh6U9b5U)]: [see below]
#> <content tool-call-id="call_jjoIvBbPW336sG0FWh6U9b5U">
#> [inline image]
#> </content>
#> ── assistant [624] ─────────────────────────────────────────────────────────────
#> The cat looks curious and attentive, perhaps a touch cautious. Reasons:
#> - Ears are forward and upright, signaling interest.
#> - Wide, focused eyes suggest it’s watching or evaluating something.
#> - Whiskers are forward, which often happens when a cat is exploring or concentrating.
#> - Body is upright and alert, not relaxed or scared.
#> 
#> In short: curious, observant, and a bit cautious about its surroundings. If you’d like, I can give you a few short captions to pair with the image.
#> ── user [-497] ─────────────────────────────────────────────────────────────────
#> Get 2 random cat images and describe what the cats are feeling.
#> ── assistant [346] ─────────────────────────────────────────────────────────────
#> [tool request (call_M2b8yTCopQZWj0HyA5zVT0d1)]: get_cat_image(n_images = 2L)
#> ── user [-27] ──────────────────────────────────────────────────────────────────
#> [tool result  (call_M2b8yTCopQZWj0HyA5zVT0d1)]: ["[see below: item 1]","[see below: item 2]"]
#> <content tool-call-id="call_M2b8yTCopQZWj0HyA5zVT0d1" item="1">
#> [inline image]
#> </content>
#> <content tool-call-id="call_M2b8yTCopQZWj0HyA5zVT0d1" item="2">
#> [inline image]
#> </content>
#> ── assistant [691] ─────────────────────────────────────────────────────────────
#> Here are feel descriptions for the two images:
#> 
#> - Item 1:
#>   - Left cat: confident and curious. Ears forward, eyes open and focused, relaxed posture.
#>   - Right cat: content and sleepy. Eyes closed, resting head/face on paws, relaxed body.
#> 
#> - Item 2:
#>   - Orange cat: playful and curious. Body lowered, eyes toward the green toy, ears forward, paw/face near the toy, engaged in play or exploration.
#> 
#> Want me to suggest short captions for each image?
chat$get_turns()[[3]] |> contents_markdown() |> knitr::asis_output()
image
chat$get_turns()[[7]] |> contents_markdown() |> knitr::asis_output()
image image

By moving these content types out of the tool results and into the abstract user turn
This better links the content to its source, but generally hides the markup
from user view (shinychat doesn't show the XML tags in assistant output).
@gadenbuie gadenbuie marked this pull request as ready for review August 27, 2025 21:01
@gadenbuie gadenbuie requested a review from hadley August 27, 2025 21:02
@CChen89 CChen89 mentioned this pull request Oct 7, 2025
simonpcouch added a commit to simonpcouch/bluffbench that referenced this pull request Oct 10, 2025
simonpcouch added a commit to tidyverse/vitals that referenced this pull request Oct 10, 2025
Rather than the ad-hoc AsIs data structure, adopt tidyverse/ellmer#735's approach
simonpcouch added a commit to simonpcouch/bluffbench that referenced this pull request Oct 10, 2025
@hadley hadley added this to the v0.4.0 milestone Oct 23, 2025
# ellmer (development version)

* ellmer now supports tools that return image or PDF content types, for example using `content_image_file()` or `content_image_pdf()`. (#735)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW new style is no empty line between bullets

is_tool <- map_lgl(x@contents, S7_inherits, ContentToolResult)
content <- as_json(provider, x@contents[!is_tool], ...)
if (length(content) > 0) {
data <- tool_results_separate_content(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to update chat_openai_responses() too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so! I merged the changes but have been in meetings all day; I'll pick this back up on Monday and will take a look at chat_openai_responses() then too.

chat <- chat_openai()
#> Using model = "gpt-4.1".
chat$register_tool(screenshot_website)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice example!

turn_matched <- match_tools(turn, tools)
expect_equal(turn_matched, fixture_turn_with_tool_requests(with_tool = TRUE))
})

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a couple of end-to-end tests of a tool that returns an image and the chat understands it? Maybe just for anthropic + gemini + openai? Definitely need to use cassettes.

Turn("user", contents = results[is_tool_result])
}

is_extra_content <- function(x) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected to see a check_tool_result() somewhere that would check that a tool call returns either a string, a ContentType, or a list of ContentTypes. Without that this test check feels a bit wibbly to me.


tool_results_separate_content <- function(turn) {
if (!some(turn@contents, is_tool_result)) {
return(list(tool_results = list(), contents = turn@contents))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that most of the providers do c(data$tool_results, data$contents) I think you should put the contents first in the list. But can you tell me more about why this just doesn't return a single list?

Copy link
Collaborator Author

@gadenbuie gadenbuie Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In almost all cases, the tool results need to be first in the content list in the user turn. Most APIs will error out if the tool results aren't immediately next in the messages list after the assistant turn. So that motivated the "tool results first" in this helper.

They're two separate items in the list because some APIs want tool results in a set of separate messages. In particular, I believe this is the case with OpenAI. I thought it'd be better to return separate items and combine them as needed than to need to repeat the filtering in some places later. Secondarily, I also liked that the naming makes it clear that we're ordering content as tool results then contents when we do so, rather than hiding that detail in the helper function.

(Those are loose preferences though...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that separating them out might introduce some subtle bug because we might now be reordering them. But more importantly, I think the reification of our data structure to what the provider expects is best keep close to the provider, and given that OpenAI is the exception rather than the rule, I think this should just return a simple list. Does that make sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're both identifying the same risk: provider APIs certainly care about how content and tool results are ordered, and sometimes they don't even want them passed together in the same message.

separating them out might introduce some subtle bug because we might now be reordering them

Interestingly, we have this exact subtle bug in ellmer currently where Deepseek and Openai both have code like this:

# Deepseek
c(texts_out, tools_out)

# OpenAI (/completions)
# (...after moving tools into their own messages)
c(user, tools)

These will both cause API errors because the order needs to be reversed: tools, then text. We never hit those errors because it's unnatural in current usage.

Given that providers have opinions about the ordering of tool results and text, my take is that it's better to require that we explicitly make a choice in the provider implementations about how they are ordered.

I think the reification of our data structure to what the provider expects is best keep close to the provider,

I would say I'm making this exact argument. Where and how the tool results are placed in the JSON sent to the API is a provider expectation, and it being okay to have results and other content in the same message (with results first and content second) is explicitly a provider choice; not all providers do this. A flat list of c(tool_results, contents) is the most common ordering, but as for OpenAI different providers may make their own choices.

Anyway, I'm not super attached to this and I'll be happy to implement the flat list to see how that feels. I'm just giving some background because I feel we're thinking of the same risks and approaching this with similar design philosophies, just ending up with slightly different conclusions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, that's reasonable. I think my intuition about this design is coming from my strong belief that the Claude model is "correct" and the openAI is clearly inferior (because of the way that Claude makes it clear that there's a user turn consisting of various types of content, then an assistant turn consisting of various types of content).
That said, this code is only used for non-Claude APIs so maybe that doesn't apply.

But I still feel like there are two things happening here — first we flatten tool requests with multiple content types and then we pull out the tool results into their own list for providers that need it. So maybe I'd prefer an API like tool_results_flatten() followed by tool_results_split(). But I haven't closely read the implementation, so it might be that this would make the code more complicated and/or less clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants