Skip to content

Commit

Permalink
update docs of internVL by H.Lyu
Browse files Browse the repository at this point in the history
  • Loading branch information
ColorfulDick committed Aug 16, 2024
1 parent a202595 commit a702f4d
Show file tree
Hide file tree
Showing 2 changed files with 389 additions and 133 deletions.
257 changes: 193 additions & 64 deletions docs/en/multi_modal/internvl.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,214 @@
# internvl
# InternVL

## Introduction
LMDeploy supports the following InternVL series of models, which are detailed in the table below:

InternVL2 is a powerful open-source visual language model (VLM) developed by OpenGVLab, LMDeploy now supports the InternVL2 series models on both PyTorch backend and TurboMind backend [OpenGVLab/InternVL2](https://huggingface.co/collections/OpenGVLab/internvl-20-667d3961ab5eb12c7ed1463e), Here we use the InternVL2-26B model [OpenGVLab/InternVL2-26B](https://huggingface.co/OpenGVLab/InternVL2-26B) As an example, illustrate how to deploy the InternVL series models using the LMDeploy framework;
| Model | Size | Supported Inference Engine |
| :---------: | :-----: | :------------------------: |
| InternVL | 13B-19B | TurboMind |
| InternVL1.5 | 2B-26B | TurboMind, PyTorch |
| InternVL2 | 1B | PyTorch |
| InternVL2 | 2B-76B | TurboMind, PyTorch |

### Environment configuration and image construction
The next chapter demonstrates how to deploy an InternVL model using LMDeploy, with [InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B) as an example.

Install Docker, Docker compose, and [compose gpu support](https://docs.docker.com/compose/gpu-support/) And ensure that your host's CUDA version is higher than 11.8 if the host's CUDA major version is CUDA11; If the CUDA major version of the host is CUDA12, please ensure that it is higher than 12.2; If the CUDA version in the image is higher than the CUDA version of the host computer, Pytorch in the image will not be available; After completing the environment configuration, execute in the root directory of the lmdeploy project:
## Installation

Please install LMDeploy by following the [installation guide](../installation.md), and install other packages that InternVL2 needs

```shell
pip install timm
# It is recommended to find the whl package that matches the environment from the releases on https://github.com/Dao-AILab/flash-attention.
pip install flash-attn
```

Or, you can build a docker image to set up the inference environment. If the CUDA version on your host machine is `>=12.4`, you can run:

```
docker build --build-arg CUDA_VERSION=cu12 -t openmmlab/lmdeploy:internvl . -f ./docker/InternVL_Dockerfile
```

Otherwise, you can go with:

```shell
git clone https://github.com/InternLM/lmdeploy.git
cd lmdeploy
docker build --build-arg CUDA_VERSION=cu11 -t openmmlab/lmdeploy:internvl . -f ./docker/InternVL_Dockerfile
```

## Offline inference

The following sample code shows the basic usage of VLM pipeline. For detailed information, please refer to [VLM Offline Inference Pipeline](./vl_pipeline.md)

```python
from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL2-8B')

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe((f'describe this image', image))
print(response)
```

More examples are listed below:

<details>
<summary>
<b>multi-image multi-round conversation, combined images</b>
</summary>

```python
from lmdeploy import pipeline, GenerationConfig
from lmdeploy.vl.constants import IMAGE_TOKEN

pipe = pipeline('OpenGVLab/InternVL2-8B', log_level='INFO')
messages = [
dict(role='user', content=[
dict(type='text', text=f'<img>{IMAGE_TOKEN}{IMAGE_TOKEN}</img>\nDescribe the two images in detail.'),
dict(type='image_url', image_url=dict(max_dynamic_patch=12, url='https://raw.githubusercontent.com/OpenGVLab/InternVL/main/internvl_chat/examples/image1.jpg')),
dict(type='image_url', image_url=dict(max_dynamic_patch=12, url='https://raw.githubusercontent.com/OpenGVLab/InternVL/main/internvl_chat/examples/image2.jpg'))
])
]
out = pipe(messages, gen_config=GenerationConfig(top_k=1))

messages.append(dict(role='assistant', content=out.text))
messages.append(dict(role='user', content='What are the similarities and differences between these two images.'))
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
```

</details>

<details>
<summary>
<b>multi-image multi-round conversation, separate images</b>
</summary>

```python
from lmdeploy import pipeline, GenerationConfig
from lmdeploy.vl.constants import IMAGE_TOKEN

pipe = pipeline('OpenGVLab/InternVL2-8B', log_level='INFO')
messages = [
dict(role='user', content=[
dict(type='text', text=f'Image-1: <img>{IMAGE_TOKEN}</img>\nImage-2: <img>{IMAGE_TOKEN}</img>\nDescribe the two images in detail.'),
dict(type='image_url', image_url=dict(max_dynamic_patch=12, url='https://raw.githubusercontent.com/OpenGVLab/InternVL/main/internvl_chat/examples/image1.jpg')),
dict(type='image_url', image_url=dict(max_dynamic_patch=12, url='https://raw.githubusercontent.com/OpenGVLab/InternVL/main/internvl_chat/examples/image2.jpg'))
])
]
out = pipe(messages, gen_config=GenerationConfig(top_k=1))

messages.append(dict(role='assistant', content=out.text))
messages.append(dict(role='user', content='What are the similarities and differences between these two images.'))
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
```

</details>

<details>
<summary>
<b>video multi-round conversation</b>
</summary>

```python
import numpy as np
from lmdeploy import pipeline, GenerationConfig
from decord import VideoReader, cpu
from lmdeploy.vl.constants import IMAGE_TOKEN
from lmdeploy.vl.utils import encode_image_base64
from PIL import Image
pipe = pipeline('OpenGVLab/InternVL2-8B', log_level='INFO')


def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
if bound:
start, end = bound[0], bound[1]
else:
start, end = -100000, 100000
start_idx = max(first_idx, round(start * fps))
end_idx = min(round(end * fps), max_frame)
seg_size = float(end_idx - start_idx) / num_segments
frame_indices = np.array([
int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
for idx in range(num_segments)
])
return frame_indices


def load_video(video_path, bound=None, num_segments=32):
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
max_frame = len(vr) - 1
fps = float(vr.get_avg_fps())
pixel_values_list, num_patches_list = [], []
frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
imgs = []
for frame_index in frame_indices:
img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
imgs.append(img)
return imgs


video_path = 'red-panda.mp4'
imgs = load_video(video_path, num_segments=8)

question = ''
for i in range(len(imgs)):
question = question + f'Frame{i+1}: <img>{IMAGE_TOKEN}</img>\n'

question += 'What is the red panda doing?'

content = [{'type': 'text', 'text': question}]
for img in imgs:
content.append({'type': 'image_url', 'image_url': {'max_dynamic_patch': 1, 'url': f'data:image/jpeg;base64,{encode_image_base64(img)}'}})

messages = [dict(role='user', content=content)]
out = pipe(messages, gen_config=GenerationConfig(top_k=1))

messages.append(dict(role='assistant', content=out.text))
messages.append(dict(role='user', content='Describe this video in detail. Don\'t repeat.'))
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
```

</details>

## Online serving

You can launch the server by the `lmdeploy serve api_server` CLI:

```shell
#build the base image, it only use to base LLM
docker build -t openmmlab/lmdeploy:latest-cu12 . -f ./docker/Dockerfile
#build the internvl need image, it can be use to VLM like intervl2
docker build -t openmmlab/lmdeploy:latest-internvl-cu12 . -f ./docker/InternVL_Dockerfile
lmdeploy serve api_server OpenGVLab/InternVL2-8B
```

### Start up
You can also start the service using the aforementioned built docker image:

After completing the image construction, create a docker-compose.yml configuration file in the root directory of the lmdeploy project; The environment variable CUDA_VISIBLEDEVICES can control which GPUs are used on the host machine, and the -- tp parameter can control how many GPUs are used
```shell
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 23333:23333 \
--ipc=host \
openmmlab/lmdeploy:internvl \
lmdeploy serve api_server OpenGVLab/InternVL2-8B
```

The docker compose is another option. Create a `docker-compose.yml` configuration file in the root directory of the lmdeploy project as follows:

```yaml
version: '3.5'

services:
lmdeploy:
container_name: lmdeploy
image: openmmlab/lmdeploy:latest-internvl-cu12
image: openmmlab/lmdeploy:internvl
ports:
- "23333:23333"
environment:
HUGGING_FACE_HUB_TOKEN: <secret>
CUDA_VISIBLE_DEVICES: 0,1,2,3
volumes:
#- ~/.cache/huggingface:/root/.cache/huggingface
- ./root:/root
- ~/.cache/huggingface:/root/.cache/huggingface
stdin_open: true
tty: true
ipc: host #for nccl,it is necessary
command: lmdeploy serve api_server OpenGVLab/InternVL2-26B --server-name 0.0.0.0 --server-port 23333 --tp 4 --model-name internvl2-internlm2 --cache-max-entry-count 0.5
ipc: host
command: lmdeploy serve api_server OpenGVLab/InternVL2-8B
deploy:
resources:
reservations:
Expand All @@ -47,20 +218,15 @@ services:
capabilities: [gpu]
```
Execute the startup command
Then, you can execute the startup command as below:
```shell
docker-compose up -d
```

Execute Docker logs - f lmdeploy and see the folllowing logs, indicating successful startup
If you find the following logs after running `docker logs -f lmdeploy`, it means the service launches successfully.

```shell
Fetching 33 files: 100%|█████████████████████████████████████████████████████████████████| 33/33 [00:00<00:00, 102908.57it/s]
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
```text
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
Expand All @@ -70,43 +236,6 @@ INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit)
```

### Testing

The following Python script can be used for Q&A to test the deployment of the internVL2 multimodal model
The arguments of `lmdeploy serve api_server` can be reviewed in detail by `lmdeploy serve api_server -h`.

```python
import asyncio
from openai import OpenAI

async def main():
client = OpenAI(api_key='YOUR_API_KEY',
base_url='http://0.0.0.0:23333/v1')
model_cards = await client.models.list()._get_page()
response = await client.chat.completions.create(
model=model_cards.data[0].id,
messages=[
{
'role': 'system',
'content': 'You are a helpful assistant.'
},
{
'role': 'user',
#'content': ' provide three suggestions about time management'
#Please list 10 places in Guangzhou that are suitable for taking girls shopping
'content':[
{"type": "text", "text": "What kind of clothing are the characters wearing in the picture?"},
{
"type": "image_url",
"image_url": {
"url": "https://c-ssl.duitang.com/uploads/item/201707/12/20170712134209_chaxS.jpeg",
},
},
]
},
],
temperature=0.8,
top_p=0.8)
print(response)

asyncio.run(main())
```
More information about `api_server` as well as how to access the service can be found from [here](api_server_vl.md)
Loading

0 comments on commit a702f4d

Please sign in to comment.