You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LLaVA OnVision uses SO400M as the vision encoder and Qwen-2.0 as the language model, with trainable components including a projector and the full model in later stages.
I'm no expert but as I understand, the architecture is similar the the previous versions, but both vision encoder and the language model are different
compared to the current supported LLaVa 1.6, it provide the following features:
Supports various input resolutions up to 2304 * 2304 pixels.
Single image input is represented by 729 * (9+1) tokens at most under anyres_max_9 mode.
Supports multi-image and video inputs. Multi-image input is represented by 729 token for each image, and video input is represented by 196 token for each frame.
Available in three sizes: 0.5B, 7B and 72B parameter versions, fit for different memory and inference latency requirements.
better support for Set-of-mark prompting
and more...
Possible Implementation
No response
The text was updated successfully, but these errors were encountered:
Prerequisites
Feature Description
LLaVA has a new version called OneVision which was released 2024/08/06
HuggingFace
GitHub
Release Notes
LLaVA OnVision uses SO400M as the vision encoder and Qwen-2.0 as the language model, with trainable components including a projector and the full model in later stages.
I'm no expert but as I understand, the architecture is similar the the previous versions, but both vision encoder and the language model are different
llama.cpp LLaVA support: https://github.com/ggerganov/llama.cpp/tree/master/examples/llava
Motivation
compared to the current supported LLaVa 1.6, it provide the following features:
Possible Implementation
No response
The text was updated successfully, but these errors were encountered: