From 3aa462e00bef6da3a5445a4588dda2e267f3eac1 Mon Sep 17 00:00:00 2001
From: Haotian Tang <kentang@mit.edu>
Date: Fri, 3 May 2024 10:49:39 -0400
Subject: [PATCH 1/3] Update README.md

---
 README.md              | 9 +++++----
 demo_trt_llm/README.md | 6 +++---
 2 files changed, 8 insertions(+), 7 deletions(-)
diff --git a/README.md b/README.md
index a55efa8b..7d30fb60 100644
--- a/README.md
+++ b/README.md
@@ -16,8 +16,9 @@ VILA is a visual language model (VLM) pretrained with interleaved image-text dat
 
  
 ## 💡 News
-- [2024/05] We release [AWQ](https://arxiv.org/pdf/2306.00978.pdf)-quantized 4bit VILA-1.5 models supported by [TinyChat](https://github.com/mit-han-lab/llm-awq/tree/main/tinychat) and [TensorRT-LLM](demo_trt_llm) backends. 
-- [2024/05] We release VILA-1.5, which comes with four model sizes (3B/8B/13B/40B) and offers native support for multi-image and video understanding.
+- [2024/05] We release VILA-1.5, which offers **video understanding support** and comes with four model sizes (3B/8B/13B/40B).
+- [2024/05] We release [AWQ](https://arxiv.org/pdf/2306.00978.pdf)-quantized 4bit VILA-1.5 models supported by [TinyChat](https://github.com/mit-han-lab/llm-awq/tree/main/tinychat) and [TensorRT-LLM](demo_trt_llm) backends.
+- [2024/03] VILA has been accepted by CVPR 2024!
 - [2024/02] We release [AWQ](https://arxiv.org/pdf/2306.00978.pdf)-quantized 4bit VILA models, deployable on Jetson Orin and laptops through [TinyChat](https://github.com/mit-han-lab/llm-awq/tree/main/tinychat) and [TinyChatEngine](https://github.com/mit-han-lab/TinyChatEngine).
 - [2024/02] VILA is released. We propose interleaved image-text pretraining that enables multi-image VLM. VILA comes with impressive in-context learning capabilities. We open source everything: including training code, evaluation code, datasets, model ckpts.
 - [2023/12] [Paper](https://arxiv.org/abs/2312.07533) is on Arxiv!
@@ -224,7 +225,7 @@ python -W ignore llava/eval/run_vila.py \
     --model-path Efficient-Large-Model/VILA1.5-3b \
     --conv-mode vicuna_v1 \
     --query "<video>\n Please describe this video." \
-    --image-file "demo.mp4"
+    --video-file "demo.mp4"
 ```
 
 ## Quantization and Deployment
@@ -240,7 +241,6 @@ We support AWQ-quantized 4bit VILA on GPU platforms via [TinyChat](https://githu
 We further support our AWQ-quantized 4bit VILA models on various CPU platforms with both x86 and ARM architectures with our [TinyChatEngine](https://github.com/mit-han-lab/TinyChatEngine). We also provide a detailed [tutorial](https://github.com/mit-han-lab/TinyChatEngine/tree/main?tab=readme-ov-file#deploy-vision-language-model-vlm-chatbot-with-tinychatengine) to help the users deploy VILA on different CPUs.
 
 
-
 ## Checkpoints
 
 We release [VILA1.5-3B](https://hf.co/Efficient-Large-Model/VILA1.5-3b), [Llama-3-VILA1.5-8B](https://hf.co/Efficient-Large-Model/Llama-3-VILA1.5-8b), [VILA1.5-13B](https://hf.co/Efficient-Large-Model/VILA1.5-13b), [VILA1.5-40B](https://hf.co/Efficient-Large-Model/VILA1.5-40b) and the 4-bit [AWQ](https://arxiv.org/abs/2306.00978)-quantized models [VILA1.5-3B-AWQ](https://hf.co/Efficient-Large-Model/VILA1.5-3b-AWQ), [Llama-3-VILA1.5-8B-AWQ](https://hf.co/Efficient-Large-Model/Llama-3-VILA1.5-8b-AWQ), [VILA1.5-13B-AWQ](https://hf.co/Efficient-Large-Model/VILA1.5-13b-AWQ), [VILA1.5-40B-AWQ](https://hf.co/Efficient-Large-Model/VILA1.5-40b-AWQ).
@@ -284,3 +284,4 @@ We release [VILA1.5-3B](https://hf.co/Efficient-Large-Model/VILA1.5-3b), [Llama-
 - [Vicuna](https://github.com/lm-sys/FastChat): the amazing open-sourced large language model!
 - [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT): we borrowed video evaluation script from this repository.
 - [MMC4](https://github.com/allenai/mmc4), [COYO-700M](https://github.com/kakaobrain/coyo-dataset), [M3IT](https://huggingface.co/datasets/MMInstruction/M3IT), [OpenORCA/FLAN](https://huggingface.co/datasets/Open-Orca/FLAN), [ShareGPT4V](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V), [WIT](google-research-datasets/wit), [GSM8K-ScRel](https://github.com/OFA-Sys/gsm8k-ScRel/blob/main/data/train_use.jsonl), [VisualGenome](https://visualgenome.org/api/v0/api_home.html), [VCR](https://visualcommonsense.com/download/), [ScienceQA](https://huggingface.co/datasets/derek-thomas/ScienceQA), [Shot2Story](https://github.com/bytedance/Shot2Story/blob/master/DATA.md), [Youcook2](http://youcook2.eecs.umich.edu/), [Vatex](https://eric-xw.github.io/vatex-website/download.html), [ShareGPT-Video](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction) for providing datasets used in this research.
+
diff --git a/demo_trt_llm/README.md b/demo_trt_llm/README.md
index 066abd6e..35b5cb49 100644
--- a/demo_trt_llm/README.md
+++ b/demo_trt_llm/README.md
@@ -25,11 +25,11 @@ pip install git+https://github.com/huggingface/transformers@v4.36.2
 ```
 ## Build TensorRT engine of VILA model
 
-### For Vila 1.0:
+### For VILA 1.0:
 
 Please refer to the [documentation from TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal#llava-and-vila) to deploy the model.
 
-### For Vila 1.5:
+### For VILA 1.5:
 
 1. Setup
 ```bash
@@ -118,4 +118,4 @@ python run.py  \
     --image_file=av.png,https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png \
     --input_text="<image>\n<image>\n Please elaborate what you see in the images?" \
     --run_profiling
-```
\ No newline at end of file
+```

From 8c97652beb3b50ce819559e8c59a05f9b4578c82 Mon Sep 17 00:00:00 2001
From: Song <3068011+songhan@users.noreply.github.com>
Date: Fri, 3 May 2024 11:12:02 -0400
Subject: [PATCH 2/3] Update README.md

---
 README.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index 7d30fb60..d33e3ccb 100644
--- a/README.md
+++ b/README.md
@@ -12,15 +12,15 @@
 [VILA arxiv](https://arxiv.org/abs/2312.07533) / [VILA Demo](https://vila-demo.hanlab.ai/) / [VILA Huggingface](https://huggingface.co/collections/Efficient-Large-Model/vila-on-pre-training-for-visual-language-models-65d8022a3a52cd9bcd62698e)
 
 ## 💡 Introduction
-VILA is a visual language model (VLM) pretrained with interleaved image-text data at scale, enabling **multi-image** VLM and **video understanding** capabilities. VILA is deployable on the edge, including Jetson Orin and laptop by [AWQ](https://arxiv.org/pdf/2306.00978.pdf) 4bit quantization through [TinyChat](https://github.com/mit-han-lab/llm-awq/tree/main/tinychat) framework. We find: (1) image-text pairs are not enough, interleaved image-text is essential; (2) unfreezing LLM during interleaved image-text pre-training enables in-context learning; (3)re-blending text-only instruction data is crucial to boost both VLM and text-only performance. VILA unveils appealing capabilities, including: multi-image reasoning, in-context learning, visual chain-of-thought, and better world knowledge. 
+VILA is a visual language model (VLM) pretrained with interleaved image-text data at scale, enabling **video understanding** and **multi-image understanding** capabilities. VILA is deployable on the edge by [AWQ](https://arxiv.org/pdf/2306.00978.pdf) 4bit quantization and [TinyChat](https://github.com/mit-han-lab/llm-awq/tree/main/tinychat) framework. We find: (1) image-text pairs are not enough, interleaved image-text is essential; (2) unfreezing LLM during interleaved image-text pre-training enables in-context learning; (3)re-blending text-only instruction data is crucial to boost both VLM and text-only performance; (4) token compression extends #video frames. VILA unveils appealing capabilities, including: video reasoning, in-context learning, visual chain-of-thought, and better world knowledge. 
 
  
 ## 💡 News
-- [2024/05] We release VILA-1.5, which offers **video understanding support** and comes with four model sizes (3B/8B/13B/40B).
-- [2024/05] We release [AWQ](https://arxiv.org/pdf/2306.00978.pdf)-quantized 4bit VILA-1.5 models supported by [TinyChat](https://github.com/mit-han-lab/llm-awq/tree/main/tinychat) and [TensorRT-LLM](demo_trt_llm) backends.
+- [2024/05] We release VILA-1.5, which offers **video understanding capability**. VILA-1.5 comes with four model sizes: 3B/8B/13B/40B.
+- [2024/05] We release [AWQ](https://arxiv.org/pdf/2306.00978.pdf)-quantized 4bit VILA-1.5 models. VILA-1.5 is efficiently deployable on NVIDIA GPUs (A100, 4090, Orin) by [TinyChat](https://github.com/mit-han-lab/llm-awq/tree/main/tinychat) and [TensorRT-LLM](demo_trt_llm) backends.
 - [2024/03] VILA has been accepted by CVPR 2024!
 - [2024/02] We release [AWQ](https://arxiv.org/pdf/2306.00978.pdf)-quantized 4bit VILA models, deployable on Jetson Orin and laptops through [TinyChat](https://github.com/mit-han-lab/llm-awq/tree/main/tinychat) and [TinyChatEngine](https://github.com/mit-han-lab/TinyChatEngine).
-- [2024/02] VILA is released. We propose interleaved image-text pretraining that enables multi-image VLM. VILA comes with impressive in-context learning capabilities. We open source everything: including training code, evaluation code, datasets, model ckpts.
+- [2024/02] VILA is released. We propose interleaved image-text pretraining that enables **multi-image** VLM. VILA comes with impressive in-context learning capabilities. We open source everything: including training code, evaluation code, datasets, model ckpts.
 - [2023/12] [Paper](https://arxiv.org/abs/2312.07533) is on Arxiv!
 
 ## Performance

From 92841377b40bac19fe697ed9f927d55bd335fbf0 Mon Sep 17 00:00:00 2001
From: Song <3068011+songhan@users.noreply.github.com>
Date: Fri, 3 May 2024 11:16:31 -0400
Subject: [PATCH 3/3] Update README.md

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index d33e3ccb..b636138b 100644
--- a/README.md
+++ b/README.md
@@ -17,7 +17,7 @@ VILA is a visual language model (VLM) pretrained with interleaved image-text dat
  
 ## 💡 News
 - [2024/05] We release VILA-1.5, which offers **video understanding capability**. VILA-1.5 comes with four model sizes: 3B/8B/13B/40B.
-- [2024/05] We release [AWQ](https://arxiv.org/pdf/2306.00978.pdf)-quantized 4bit VILA-1.5 models. VILA-1.5 is efficiently deployable on NVIDIA GPUs (A100, 4090, Orin) by [TinyChat](https://github.com/mit-han-lab/llm-awq/tree/main/tinychat) and [TensorRT-LLM](demo_trt_llm) backends.
+- [2024/05] We release [AWQ](https://arxiv.org/pdf/2306.00978.pdf)-quantized 4bit VILA-1.5 models. VILA-1.5 is efficiently deployable on diverse NVIDIA GPUs (A100, 4090, 4070 Laptop, Orin, Orin Nano) by [TinyChat](https://github.com/mit-han-lab/llm-awq/tree/main/tinychat) and [TensorRT-LLM](demo_trt_llm) backends.
 - [2024/03] VILA has been accepted by CVPR 2024!
 - [2024/02] We release [AWQ](https://arxiv.org/pdf/2306.00978.pdf)-quantized 4bit VILA models, deployable on Jetson Orin and laptops through [TinyChat](https://github.com/mit-han-lab/llm-awq/tree/main/tinychat) and [TinyChatEngine](https://github.com/mit-han-lab/TinyChatEngine).
 - [2024/02] VILA is released. We propose interleaved image-text pretraining that enables **multi-image** VLM. VILA comes with impressive in-context learning capabilities. We open source everything: including training code, evaluation code, datasets, model ckpts.