Support VILA (#94)

mit-han-lab · Feb 23, 2024 · c2d2de4 · c2d2de4
1 parent 2b93f2c
commit c2d2de4
Show file tree

Hide file tree

Showing 27 changed files with 762 additions and 164 deletions.
diff --git a/README.md b/README.md
@@ -11,6 +11,9 @@ Feel free to check out our [slides](assets/slides.pdf) for more details!
 ### Code LLaMA Demo on an NVIDIA GeForce RTX 4070 laptop:
 ![coding_demo_gpu](assets/figures/coding_demo_gpu.gif)
 
+### VILA Demo on an Apple MacBook Pro (M1, 2021):
+![vlm_demo_m1](assets/figures/vlm_demo_m1.gif)
+
 ### LLaMA Chat Demo on an Apple MacBook Pro (M1, 2021):
 ![chat_demo_m1](assets/figures/chat_demo_m1.gif)
 
@@ -34,6 +37,8 @@ Feel free to check out our [slides](assets/slides.pdf) for more details!
 
 ## News
 
+- **(2024/02)** 🔥We extended the support for vision language models (VLM). Feel free to try running [VILA](#deploy-vision-language-model-vlm-chatbot-with-tinychatengine) on your edge device.
+- **(2024/01)** 🔥We released TinyVoiceChat, a voice chatbot that can be deployed on your edge devices, such as MacBook and Jetson Orin Nano. Check out our [demo video](https://youtu.be/Bw5Dm3aWMnA?si=CCvZDmq3HwowEQcC) and follow the [instructions](#deploy-speech-to-speech-chatbot-with-tinychatengine-demo) to deploy it on your device!
 - **(2023/10)** We extended the support for the coding assistant [Code Llama](#download-and-deploy-models-from-our-model-zoo). Feel free to check out.
 - **(2023/10)** ⚡We released the new CUDA backend to support Nvidia GPUs with compute capability >= 6.1 for both server and edge GPUs. Its performance is also speeded up by ~40% compared to the previous version. Feel free to check out!
 
@@ -132,6 +137,66 @@ Here, we provide step-by-step instructions to deploy LLaMA2-7B-chat with TinyCha
   ```
 
 
+## Deploy speech-to-speech chatbot with TinyChatEngine [[Demo]](https://youtu.be/Bw5Dm3aWMnA?si=CCvZDmq3HwowEQcC)
+
+TinyChatEngine offers versatile capabilities suitable for various applications. Additionally, we introduce a sophisticated voice chatbot. Here, we provide very easy-to-follow instructions to deploy speech-to-speech chatbot (LLaMA2-7B-chat) with TinyChatEngine. 
+
+- Follow the instructions above to setup the basic environment, i.e., [Prerequisites](#prerequisites) and [Step-by-step to Deploy LLaMA2-7B-chat with TinyChatEngine](#step-by-step-to-deploy-llama2-7b-chat-with-tinychatengine).
+
+- Run the shell script to set up the environment for speech-to-speech chatbot.
+  ```bash
+  cd llm
+  ./voicechat_setup.sh
+  ```
+
+- Start the speech-to-speech chat locally.
+  ```bash
+  ./chat -v  # chat.exe -v on Windows
+  ```
+
+- If you encounter any issues or errors during setup, please explore [here](llm/application/README.md) to follow the step-by-step guide to debug.
+
+
+## Deploy vision language model (VLM) chatbot with TinyChatEngine
+
+TinyChatEngine supports not only LLM but also VLM. We introduce a sophisticated text/voice chatbot for VLM. Here, we provide very easy-to-follow instructions to deploy vision language model chatbot (VILA-7B) with TinyChatEngine.
+
+- Follow the instructions above to setup the basic environment, i.e., [Prerequisites](#prerequisites) and [Step-by-step to Deploy LLaMA2-7B-chat with TinyChatEngine](#step-by-step-to-deploy-llama2-7b-chat-with-tinychatengine).
+
+- To demonstrate images in the terminal, please download and install the following toolkit.
+  - Install [termvisage](https://github.com/AnonymouX47/termvisage).
+  - (For MacOS) Install [iTerm2](https://iterm2.com/index.html).
+  - (For other OS) Please refer to [here](https://github.com/AnonymouX47/termvisage?tab=readme-ov-file#requirements) to get the appropriate terminal ready.
+
+- (Optional) To enable the speech-to-speech chatbot for VLM, please follow the [instruction above](#deploy-speech-to-speech-chatbot-with-tinychatengine-demo) to run the shell script to set up the environment.
+
+- Download the quantized VILA-7B model from our model zoo.
+
+  - On an x86 device (e.g., Intel/AMD laptop)
+    ```bash
+    python tools/download_model.py --model VILA_7B_awq_int4_CLIP_ViT-L --QM QM_x86
+    ```
+  - On an ARM device (e.g., M1/M2 Macbook, Raspberry Pi)
+    ```bash
+    python tools/download_model.py --model VILA_7B_awq_int4_CLIP_ViT-L --QM QM_ARM
+    ```
+
+- (For MacOS) Start the chatbot locally. Please use an appropriate terminal (e.g., iTerm2).
+  - Image/Text to text
+    ```bash
+    ./scripts/vila.sh ../assets/figures/vlm_demo/pedestrian.png
+    ```
+
+  - Image/Speech to speech
+    ```bash
+    ./scripts/voice_vila.sh ../assets/figures/vlm_demo/pedestrian.png
+    ```
+
+    - There are several images under the path `../assets/figures/vlm_demo`. Feel free to try different images with VILA on your device!
+
+  - For other OS, please modify Line 4 in [vila.sh](llm/scripts/vila.sh) and [voice_vila.sh](llm/scripts/voice_vila.sh) to use the correct terminal.
+
+
 ## Backend Support
 
 | Precision | x86<br /> (Intel/AMD CPU) | ARM<br /> (Apple M1/M2 & RPi) | Nvidia GPU | Apple GPU |
@@ -258,7 +323,47 @@ We offer a selection of models that have been tested with TinyChatEngine. These
             <td> ✅ </td>
         </tr>
         <tr>
-            <td>LLaVA-1.5</td>
+            <td rowspan="2">VILA-7B</td>
+            <td> fp32</td>
+            <td> VILA_7B_CLIP_ViT-L_fp32 </td>
+            <td> ✅  </td>
+            <td> ✅  </td>
+            <td>  </td>
+        </tr>
+        <tr>
+            <!-- No data for the first column here because it's merged with data1 -->
+            <td> int4</td>
+            <td> VILA_7B_awq_int4_CLIP_ViT-L </td>
+            <td> ✅  </td>
+            <td> ✅  </td>
+            <td>  </td>
+        </tr>
+        <tr>
+            <td rowspan="2">LLaVA-v1.5-13B</td>
+            <td> fp32</td>
+            <td> LLaVA_13B_CLIP_ViT-L_fp32 </td>
+            <td> ✅  </td>
+            <td> ✅  </td>
+            <td>  </td>
+        </tr>
+        <tr>
+            <!-- No data for the first column here because it's merged with data1 -->
+            <td> int4</td>
+            <td> LLaVA_13B_awq_int4_CLIP_ViT-L </td>
+            <td> ✅  </td>
+            <td> ✅  </td>
+            <td>  </td>
+        </tr>
+        <tr>
+            <td rowspan="2">LLaVA-v1.5-7B</td>
+            <td> fp32</td>
+            <td> LLaVA_7B_CLIP_ViT-L_fp32 </td>
+            <td> ✅  </td>
+            <td> ✅  </td>
+            <td>  </td>
+        </tr>
+        <tr>
+            <!-- No data for the first column here because it's merged with data1 -->
             <td> int4</td>
             <td> LLaVA_7B_awq_int4_CLIP_ViT-L </td>
             <td> ✅  </td>

diff --git a/assets/figures/vlm_demo/animal_blocking.png b/assets/figures/vlm_demo/animal_blocking.png
diff --git a/assets/figures/pedestrian.png → assets/figures/vlm_demo/pedestrian.png b/assets/figures/pedestrian.png → assets/figures/vlm_demo/pedestrian.png
diff --git a/assets/figures/vlm_demo/windmill_people.png b/assets/figures/vlm_demo/windmill_people.png
diff --git a/kernels/matmul.h b/kernels/matmul.h
@@ -123,6 +123,8 @@ class MatmulOperator {
     void mat_mul_accelerator_int4_fast(const struct matmul_params *params);
     void mat_mul_accelerator_int4_fast_no_offset(const struct matmul_params *params);
     void mat_mul_accelerator_int8_int4_fast_no_offset(struct matmul_params *params);
+    void gemv_accelerator_int8_int4_fast_no_offset(struct matmul_params *params);
+    void gemm_accelerator_int8_int4_fast_no_offset(struct matmul_params *params);
     void naive_mat_mul_int4(const struct matmul_params *params);
     void naive_mat_mul_int4_with_offset(const struct matmul_params *params);
     // cuda