Context size and examples for LongVILA #141

yulinzou · 2024-09-30T08:12:34Z

Hello,

I'm new to LLM serving and multi-modal LLMs. I'm looking for similar examples for the LongVILA model, like the one for VILA1.5 models:

python -W ignore llava/eval/run_vila.py  --model-path Efficient-Large-Model/VILA1.5-13b    --conv-mode vicuna_v1    --query "<video>\n Describe what happened in the video."   --video-file "./example.mp4"  --num-video-frames 20

Specifically, I'd like to know what conv-mode I should use and the maximum frame number for both LongVILA and VILA1.5 models. I also noticed the paper mentioned a downsampler that can reduce the number of tokens for an image; do you have an example of how to use that?

Thanks!

The text was updated successfully, but these errors were encountered:

Lyken17 · 2024-11-19T14:07:11Z

@yukang2017 can you help confirm context length? I think the conv mode should be llama3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context size and examples for LongVILA #141

Context size and examples for LongVILA #141

yulinzou commented Sep 30, 2024 •

edited

Loading

Lyken17 commented Nov 19, 2024

Context size and examples for LongVILA #141

Context size and examples for LongVILA #141

Comments

yulinzou commented Sep 30, 2024 • edited Loading

Lyken17 commented Nov 19, 2024

yulinzou commented Sep 30, 2024 •

edited

Loading