You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, great work here. But I wonder if we could make this run completely locally? e.g. with an Ollama based model? has anyone tried this? are models good enough (the small ones that fit on a, say 16GB mem, PC/MAC) to understand screenshots?
One of the key enabler of computer control is the LM looking at the image and prediction the action with proper coordinates.
This feature is surprisingly accurate on the new claude-3.5-sonnet model.
Not very confident on the samller VLMs being able to do that accurately (as of today). Hopefully someone can create a finetune dataset and then we can have smaller/quantized models do accurately on this step (which might affect the reasoning capabilities then).
Hello, great work here. But I wonder if we could make this run completely locally? e.g. with an Ollama based model? has anyone tried this? are models good enough (the small ones that fit on a, say 16GB mem, PC/MAC) to understand screenshots?
Hope to hear back from you @deedy
The text was updated successfully, but these errors were encountered: