Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand core api to allow for reading images #1

Open
bdougie opened this issue Feb 26, 2025 · 1 comment
Open

Expand core api to allow for reading images #1

bdougie opened this issue Feb 26, 2025 · 1 comment
Assignees

Comments

@bdougie
Copy link

bdougie commented Feb 26, 2025

I am working on an agent using the api - https://github.com/bdougie/vision

A lot of the work in my agent is preparing a video for using the vision model to return a description. Today the api doesn't quite send the image to ollama for processing.

https://github.com/bdougie/vision/blob/51fb3a17f0b7e9273798c05f86ca435aa575d109/main.go#L41-L59

func analyzeImage(ctx context.Context, a *agent.DefaultAgent, imagePath string) (string, error) {
	imageData, err := os.ReadFile(imagePath)
	if (err != nil) {
		return "", err
	}


	// Create vision prompt with image data
	prompt := fmt.Sprintf(`[
		{"type": "text", "text": "Describe this image in detail."},
		{"type": "image", "source": {"data": "%s", "media_type": "image/jpeg"}}
	]`, imageData)


	response, err := a.Run(ctx, prompt, agent.DefaultStopCondition)
	if err != nil {
		return "", err
	}

        // this line does not return what I need today. 
	return response[0].Message.Content, nil
}

I may be wrong on this as I am still figuring it out, but the ollama way locally, I can simply to do this:

ollama run llama3.2-vision:11b
>>> describe the image at this location /./frame_0003.jpg

What would I like to see?

Perhaps a tool that is ready to manage an image for the model.

@jpmcb jpmcb self-assigned this Feb 26, 2025
@jpmcb
Copy link
Contributor

jpmcb commented Feb 27, 2025

Currently core doesn't really support images although they exist on the "image" type:

// Message represents a single message in a conversation with multimodal support
type Message struct {
        // etc. etc.

	// A list of base64-encoded images (for multimodal models such as llava
	// or llama3.2-vision)
	Images []string
}

It should be as simple as getting the base64 encodings for the images and passing it to the agent. Something like:

func (a *Agent) WithImages(base64Img string) 

A few things to think through:

  • Do we expect that an end user would already have the images base64 encoded? Or should that be something that agent-api does for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants