Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context size and examples for LongVILA #141

Open
yulinzou opened this issue Sep 30, 2024 · 1 comment
Open

Context size and examples for LongVILA #141

yulinzou opened this issue Sep 30, 2024 · 1 comment

Comments

@yulinzou
Copy link

yulinzou commented Sep 30, 2024

Hello,

I'm new to LLM serving and multi-modal LLMs. I'm looking for similar examples for the LongVILA model, like the one for VILA1.5 models:

python -W ignore llava/eval/run_vila.py  --model-path Efficient-Large-Model/VILA1.5-13b    --conv-mode vicuna_v1    --query "<video>\n Describe what happened in the video."   --video-file "./example.mp4"  --num-video-frames 20

Specifically, I'd like to know what conv-mode I should use and the maximum frame number for both LongVILA and VILA1.5 models. I also noticed the paper mentioned a downsampler that can reduce the number of tokens for an image; do you have an example of how to use that?

Thanks!

@Lyken17
Copy link
Collaborator

Lyken17 commented Nov 19, 2024

@yukang2017 can you help confirm context length? I think the conv mode should be llama3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants