You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm new to LLM serving and multi-modal LLMs. I'm looking for similar examples for the LongVILA model, like the one for VILA1.5 models:
python -W ignore llava/eval/run_vila.py --model-path Efficient-Large-Model/VILA1.5-13b --conv-mode vicuna_v1 --query "<video>\n Describe what happened in the video." --video-file "./example.mp4" --num-video-frames 20
Specifically, I'd like to know what conv-mode I should use and the maximum frame number for both LongVILA and VILA1.5 models. I also noticed the paper mentioned a downsampler that can reduce the number of tokens for an image; do you have an example of how to use that?
Thanks!
The text was updated successfully, but these errors were encountered:
Hello,
I'm new to LLM serving and multi-modal LLMs. I'm looking for similar examples for the LongVILA model, like the one for VILA1.5 models:
Specifically, I'd like to know what conv-mode I should use and the maximum frame number for both LongVILA and VILA1.5 models. I also noticed the paper mentioned a downsampler that can reduce the number of tokens for an image; do you have an example of how to use that?
Thanks!
The text was updated successfully, but these errors were encountered: