-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multimodal support in prompt_template for easier prompting. #242
Comments
Since |
Note that users can pass in several images and images can be a part of any user message, not just the final message. I suggest something like this: prompt_template = """
SYSTEM:
...
USER:
<you need a way to pass a string or a list of multiple input types here>
ASSISTANT:
<AI can also output text, audio, images, etc. Just like in GPT-4o>
USER:
<you need a way to pass a string or a list of multiple input types here>
""" History should also take this into account. I'm not sure if this is the right abstraction. Maybe OpenAI abstraction of representing chat history as a list is already great. Because this is a chat model, not an instruct model, if you want to utilize python features, you should model the prompt template as a list, not as a string. It feels like we are repeating the mistake of LangChain. Forcing |
I agree, which is why we originally opted for enabling writing the messages array directly. We also enable the I do think there is a way to update the prompt template parser to provide good DX for the multi-modal case; however, as you mentioned I also don't think the solution is a self-defined language. Just because a multi-modal user messages has a content array doesn't necessary mean that we also need to have an array in the prompt template through a custom language. In fact, I think there is potentially a rather nice way of writing multi-modal messages still as a single string. For example: from mirascope.openai import OpenAICall, OpenAIImage
class MultiModalCall(OpenAICall):
prompt_template = "Can you please describe this image? {image}"
img_bytes: bytes
@property
def image(self) -> OpenAIImage:
return OpenAIImage(media_type="jpeg", bytes=self.img_bytes) To me, this feels more natural as a transcript and how I would generally interact with the chat model anyway. Then, under the hood we can parse the user message into the correct content array if images are provided. Of course we would also want to ensure:
What do you think @off6atomic @brenkao? |
If this works across all our various providers, then I'm all for it. |
@willbakst I think that's a better syntax for producing a list of inputs indeed. I totally missed that. However, I still think this is a custom markup, which means it needs to be very easy for users to understand how it's parsed and there should be a page that explicitly explains how the custom markup is parsed into OpenAI format (or internal Mirascope format). I would suggest using this syntax in a way that tells to the user it's simply being parsed to a list (and users can specify order of the items in the list). For example, if user wants to pass Users should also be allowed to type the inputs in multiple lines e.g. """
USER:
What is the following image?
{image}
How does it relate to the following audio and video?
{audio} {video}
I want you to describe the relationship in {style} tone.
""" would be translated to I think this provides a simple mental representation for users to understand the parser. It's just splitting the string by non-text inputs. One thing we need to be clear to users is how we remove whitespaces and newlines surrounding non-text inputs. Here is a typical use case: """
USER:
Please look at the cat and dog images and tell me which one is more cute.
{cat_image} {dog_image}
""" Note that |
@off6atomic 100%, everything you've described is pretty much exactly the behavior I would expect. The goal is for the parser to feel intuitive and behave how you would expect so it's "convenient" and not "magic" (but still feels like magic). Of course, I totally agree that in order for this not to be "magic" we need to have extremely clear documentation. For the README examples, this will likely be simple comments + examples of what the output messages will look like so it's succinct. In the concepts/writing_prompts.md docs page we should add a more detailed writeup of exactly what is happening under the hood so it's extremely clear to users what's happening. We can also mention in the README with this update that users should read the docs for more details. How we handle the parser will need some more thought as we work towards implementing this feature and see what makes sense both from an internal implementation perspective as well as the external DX perspective. Mostly want to make sure that any decisions we make for parsing image/audio prompts doesn't have unintended effects on other prompts. I'm hoping to find some time soon to prioritize this feature now that we've got a good idea of the interface and DX. |
Description
First thing that comes to mind is to add something like this
The text was updated successfully, but these errors were encountered: