feat: Support images and PDFs in tool results #735

gadenbuie · 2025-08-27T15:18:33Z

This PR adds support for tool results to return images or PDF results.

This isn't a feature that's widely supported in provider APIs, but we get around this limitation by moving image and PDF content out of the tool result and into the abstract user turn that carries the tool results.

We support two cases:

Directly returning a content_image() or content_pdf() as a tool result.
Returning a list that contains, at most one level deep, these content types.

In all cases, we replace the value in the tool result with "[see below]" (or "[see below: item N]" in the list case) and we wrap the extra content in <content tool-call-id="abc123" item="N">...content...</content> XML tags.

Notes

OpenAI requires that tool results are a separate message and follow the assistant message. This appears to be common among providers when the tool results are separated. I checked all as_json() methods for Turn and updated them to return tool_message, user_message.
tool_string() doesn't support having these content types in the tool result because it calls jsonlite::toJSON(). I updated this function so that internally we can force the JSON conversion for printing, but require this work for the actual tool results that we send across the wire. If it fails, it now fails with a more informative error message. (Internally we call this function when echoing the tool result, before we've pulled out the content types.)

Example

pkgload::load_all()
#> ℹ Loading ellmer

get_cat_image <- function() {
  size <- sample(200:300, 1)
  url <- sprintf("https://placecats.com/%d/%d", size, size)

  tmpf <- withr::local_tempfile(fileext = ".jpg")
  download.file(url, tmpf, quiet = TRUE)

  content_image_file(tmpf, resize = "none")
}

chat <- chat("openai/gpt-5-nano", echo = "none")
# chat <- chat("anthropic")
# chat <- chat("google_gemini")
# chat <- chat_deepseek(echo = "output")
# There aren't many tool+vision Ollama models, but this one should work (but not on my M1)
# chat <- chat_ollama(model = "mistral-small3.2", echo = "output")
chat$register_tool(
  tool(
    function(n_images = 1) {
      if (n_images == 1) {
        get_cat_image()
      } else {
        lapply(seq_len(n_images), function(i) get_cat_image())
      }
    },
    name = "get_cat_image",
    description = "Gets a random cat image.",
    arguments = list(
      n_images = type_integer("Number of cat images to get at once.")
    )
  )
)

. <- chat$chat(
  "Get a random cat image and describe what the cat is feeling."
)
. <- chat$chat(
  "Get 2 random cat images and describe what the cats are feeling."
)

chat
#> <Chat OpenAI/gpt-5-nano turns=8 tokens=1826/1942 $0.00>
#> ── user [149] ──────────────────────────────────────────────────────────────────
#> Get a random cat image and describe what the cat is feeling.
#> ── assistant [281] ─────────────────────────────────────────────────────────────
#> [tool request (call_jjoIvBbPW336sG0FWh6U9b5U)]: get_cat_image(n_images = 1L)
#> ── user [-62] ──────────────────────────────────────────────────────────────────
#> [tool result  (call_jjoIvBbPW336sG0FWh6U9b5U)]: [see below]
#> <content tool-call-id="call_jjoIvBbPW336sG0FWh6U9b5U">
#> [inline image]
#> </content>
#> ── assistant [624] ─────────────────────────────────────────────────────────────
#> The cat looks curious and attentive, perhaps a touch cautious. Reasons:
#> - Ears are forward and upright, signaling interest.
#> - Wide, focused eyes suggest it’s watching or evaluating something.
#> - Whiskers are forward, which often happens when a cat is exploring or concentrating.
#> - Body is upright and alert, not relaxed or scared.
#> 
#> In short: curious, observant, and a bit cautious about its surroundings. If you’d like, I can give you a few short captions to pair with the image.
#> ── user [-497] ─────────────────────────────────────────────────────────────────
#> Get 2 random cat images and describe what the cats are feeling.
#> ── assistant [346] ─────────────────────────────────────────────────────────────
#> [tool request (call_M2b8yTCopQZWj0HyA5zVT0d1)]: get_cat_image(n_images = 2L)
#> ── user [-27] ──────────────────────────────────────────────────────────────────
#> [tool result  (call_M2b8yTCopQZWj0HyA5zVT0d1)]: ["[see below: item 1]","[see below: item 2]"]
#> <content tool-call-id="call_M2b8yTCopQZWj0HyA5zVT0d1" item="1">
#> [inline image]
#> </content>
#> <content tool-call-id="call_M2b8yTCopQZWj0HyA5zVT0d1" item="2">
#> [inline image]
#> </content>
#> ── assistant [691] ─────────────────────────────────────────────────────────────
#> Here are feel descriptions for the two images:
#> 
#> - Item 1:
#>   - Left cat: confident and curious. Ears forward, eyes open and focused, relaxed posture.
#>   - Right cat: content and sleepy. Eyes closed, resting head/face on paws, relaxed body.
#> 
#> - Item 2:
#>   - Orange cat: playful and curious. Body lowered, eyes toward the green toy, ears forward, paw/face near the toy, engaged in play or exploration.
#> 
#> Want me to suggest short captions for each image?

chat$get_turns()[[3]] |> contents_markdown() |> knitr::asis_output()

chat$get_turns()[[7]] |> contents_markdown() |> knitr::asis_output()

By moving these content types out of the tool results and into the abstract user turn

…or future

This better links the content to its source, but generally hides the markup from user view (shinychat doesn't show the XML tags in assistant output).

R/chat-tools.R

Rather than the ad-hoc AsIs data structure, adopt tidyverse/ellmer#735's approach

hadley · 2025-10-24T18:37:26Z

NEWS.md

 # ellmer (development version)

+* ellmer now supports tools that return image or PDF content types, for example using `content_image_file()` or `content_image_pdf()`. (#735)
+


FWIW new style is no empty line between bullets

hadley · 2025-10-24T18:39:35Z

R/provider-openai.R

-    is_tool <- map_lgl(x@contents, S7_inherits, ContentToolResult)
-    content <- as_json(provider, x@contents[!is_tool], ...)
-    if (length(content) > 0) {
+    data <- tool_results_separate_content(x)


Do you need to update chat_openai_responses() too?

Yes, I think so! I merged the changes but have been in meetings all day; I'll pick this back up on Monday and will take a look at chat_openai_responses() then too.

hadley · 2025-10-24T18:40:30Z

vignettes/tool-calling.Rmd

+
+chat <- chat_openai()
+#> Using model = "gpt-4.1".
+chat$register_tool(screenshot_website)


Nice example!

hadley · 2025-10-24T18:41:50Z

tests/testthat/test-chat-tools.R

  turn_matched <- match_tools(turn, tools)
  expect_equal(turn_matched, fixture_turn_with_tool_requests(with_tool = TRUE))
 })
+


Could we have a couple of end-to-end tests of a tool that returns an image and the chat understands it? Maybe just for anthropic + gemini + openai? Definitely need to use cassettes.

hadley · 2025-10-24T18:43:58Z

R/chat-tools.R

  Turn("user", contents = results[is_tool_result])
 }

+is_extra_content <- function(x) {


I expected to see a check_tool_result() somewhere that would check that a tool call returns either a string, a ContentType, or a list of ContentTypes. Without that this test check feels a bit wibbly to me.

hadley · 2025-10-24T18:44:39Z

R/chat-tools.R

+
+tool_results_separate_content <- function(turn) {
+  if (!some(turn@contents, is_tool_result)) {
+    return(list(tool_results = list(), contents = turn@contents))


Given that most of the providers do c(data$tool_results, data$contents) I think you should put the contents first in the list. But can you tell me more about why this just doesn't return a single list?

In almost all cases, the tool results need to be first in the content list in the user turn. Most APIs will error out if the tool results aren't immediately next in the messages list after the assistant turn. So that motivated the "tool results first" in this helper.

They're two separate items in the list because some APIs want tool results in a set of separate messages. In particular, I believe this is the case with OpenAI. I thought it'd be better to return separate items and combine them as needed than to need to repeat the filtering in some places later. Secondarily, I also liked that the naming makes it clear that we're ordering content as tool results then contents when we do so, rather than hiding that detail in the helper function.

(Those are loose preferences though...)

I worry that separating them out might introduce some subtle bug because we might now be reordering them. But more importantly, I think the reification of our data structure to what the provider expects is best keep close to the provider, and given that OpenAI is the exception rather than the rule, I think this should just return a simple list. Does that make sense?

I think we're both identifying the same risk: provider APIs certainly care about how content and tool results are ordered, and sometimes they don't even want them passed together in the same message.

separating them out might introduce some subtle bug because we might now be reordering them

Interestingly, we have this exact subtle bug in ellmer currently where Deepseek and Openai both have code like this:

# Deepseek c(texts_out, tools_out) # OpenAI (/completions) # (...after moving tools into their own messages) c(user, tools)

These will both cause API errors because the order needs to be reversed: tools, then text. We never hit those errors because it's unnatural in current usage.

Given that providers have opinions about the ordering of tool results and text, my take is that it's better to require that we explicitly make a choice in the provider implementations about how they are ordered.

I think the reification of our data structure to what the provider expects is best keep close to the provider,

I would say I'm making this exact argument. Where and how the tool results are placed in the JSON sent to the API is a provider expectation, and it being okay to have results and other content in the same message (with results first and content second) is explicitly a provider choice; not all providers do this. A flat list of c(tool_results, contents) is the most common ordering, but as for OpenAI different providers may make their own choices.

Anyway, I'm not super attached to this and I'll be happy to implement the flat list to see how that feels. I'm just giving some background because I feel we're thinking of the same risks and approaching this with similar design philosophies, just ending up with slightly different conclusions.

Hmmm, that's reasonable. I think my intuition about this design is coming from my strong belief that the Claude model is "correct" and the openAI is clearly inferior (because of the way that Claude makes it clear that there's a user turn consisting of various types of content, then an assistant turn consisting of various types of content).
That said, this code is only used for non-Claude APIs so maybe that doesn't apply.

But I still feel like there are two things happening here — first we flatten tool requests with multiple content types and then we pull out the tool results into their own list for providers that need it. So maybe I'd prefer an API like tool_results_flatten() followed by tool_results_split(). But I haven't closely read the implementation, so it might be that this would make the code more complicated and/or less clear.

gadenbuie added 4 commits August 27, 2025 10:27

feat: Support image and PDF tool results

45fcc4a

By moving these content types out of the tool results and into the abstract user turn

chore(deepseek): Tool results need to be before text output

1aa7c46

chore(databricks): Doesn't support image/pdf content but leave note f…

b318a65

…or future

chore: Wrap moved content in xml tags

b2908db

This better links the content to its source, but generally hides the markup from user view (shinychat doesn't show the XML tags in assistant output).

gadenbuie commented Aug 27, 2025

View reviewed changes

R/chat-tools.R Outdated Show resolved Hide resolved

gadenbuie added 3 commits August 27, 2025 15:26

refactor: Wait until API request to separate tool results and content

ec909f4

tests: Add tests around tool_results_separate_content()

a73cdb5

fix: update tool result value correctly in list case

cee44b3

gadenbuie mentioned this pull request Aug 27, 2025

feat: Support images/PDFs in tool results posit-dev/shinychat#106

Open

gadenbuie added 2 commits August 27, 2025 17:00

docs: Document image/pdf tool output in vignette

baff771

docs: Add NEWS item

4d640c7

gadenbuie marked this pull request as ready for review August 27, 2025 21:01

gadenbuie requested a review from hadley August 27, 2025 21:02

CChen89 mentioned this pull request Oct 7, 2025

Allow for multiple output #769

Closed

simonpcouch mentioned this pull request Oct 10, 2025

support images in tool call output with OpenAI #772

Closed

simonpcouch added a commit to simonpcouch/bluffbench that referenced this pull request Oct 10, 2025

transition to tidyverse/ellmer#735 and content_image_file()

4110b2a

simonpcouch added a commit to tidyverse/vitals that referenced this pull request Oct 10, 2025

only log images with ContentImageInline

3c23101

Rather than the ad-hoc AsIs data structure, adopt tidyverse/ellmer#735's approach

simonpcouch added a commit to simonpcouch/bluffbench that referenced this pull request Oct 10, 2025

regenerate with tidyverse/ellmer#735

a6f2691

hadley added this to the v0.4.0 milestone Oct 23, 2025

Merge 'origin/main' into branch feat/tool-image

c703617

hadley reviewed Oct 24, 2025

View reviewed changes

gadenbuie mentioned this pull request Nov 6, 2025

feat: Complete dangling tool requests to avoid API errors #840

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Support images and PDFs in tool results #735

feat: Support images and PDFs in tool results #735

gadenbuie commented Aug 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

hadley Oct 24, 2025

Uh oh!

hadley Oct 24, 2025

Uh oh!

gadenbuie Oct 24, 2025

Uh oh!

hadley Oct 24, 2025

Uh oh!

hadley Oct 24, 2025

Uh oh!

hadley Oct 24, 2025

Uh oh!

hadley Oct 24, 2025

Uh oh!

gadenbuie Oct 24, 2025 •

edited

Loading

Uh oh!

hadley Nov 4, 2025

Uh oh!

gadenbuie Nov 4, 2025

Uh oh!

hadley Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# ellmer (development version)

		* ellmer now supports tools that return image or PDF content types, for example using `content_image_file()` or `content_image_pdf()`. (#735)

feat: Support images and PDFs in tool results #735

Are you sure you want to change the base?

feat: Support images and PDFs in tool results #735

Conversation

gadenbuie commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes

Example

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gadenbuie Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gadenbuie commented Aug 27, 2025 •

edited

Loading

gadenbuie Oct 24, 2025 •

edited

Loading