Skip to content

Commit 635e13f

Browse files
authored
Merge pull request #6 from VectorInstitute/develop
v0.2.0
2 parents 9e532c4 + bf0b8d4 commit 635e13f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+2481
-1247
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,3 +149,6 @@ logs/
149149
local/
150150
slurm/
151151
scripts/
152+
153+
# vLLM bug reporting files
154+
collect_env.py

Dockerfile

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
FROM nvidia/cuda:12.3.1-devel-ubuntu20.04
2+
3+
# Non-interactive apt-get commands
4+
ARG DEBIAN_FRONTEND=noninteractive
5+
6+
# No GPUs visible during build
7+
ARG CUDA_VISIBLE_DEVICES=none
8+
9+
# Specify CUDA architectures -> 7.5: RTX 6000 & T4, 8.0: A100, 8.6+PTX
10+
ARG TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6+PTX"
11+
12+
# Set the Python version
13+
ARG PYTHON_VERSION=3.10.12
14+
15+
# Install dependencies for building Python
16+
RUN apt-get update && apt-get install -y \
17+
wget \
18+
build-essential \
19+
libssl-dev \
20+
zlib1g-dev \
21+
libbz2-dev \
22+
libreadline-dev \
23+
libsqlite3-dev \
24+
libffi-dev \
25+
libncursesw5-dev \
26+
xz-utils \
27+
tk-dev \
28+
libxml2-dev \
29+
libxmlsec1-dev \
30+
liblzma-dev \
31+
git \
32+
vim \
33+
&& rm -rf /var/lib/apt/lists/*
34+
35+
# Download and install Python from precompiled binaries
36+
RUN wget https://www.python.org/ftp/python/$PYTHON_VERSION/Python-$PYTHON_VERSION.tgz && \
37+
tar -xzf Python-$PYTHON_VERSION.tgz && \
38+
cd Python-$PYTHON_VERSION && \
39+
./configure --enable-optimizations && \
40+
make -j$(nproc) && \
41+
make altinstall && \
42+
cd .. && \
43+
rm -rf Python-$PYTHON_VERSION.tgz Python-$PYTHON_VERSION
44+
45+
# Download and install pip using get-pip.py
46+
RUN wget https://bootstrap.pypa.io/get-pip.py && \
47+
python3.10 get-pip.py && \
48+
rm get-pip.py
49+
50+
# Ensure pip for Python 3.10 is used
51+
RUN python3.10 -m pip install --upgrade pip
52+
53+
# Install Poetry using Python 3.10
54+
RUN python3.10 -m pip install poetry
55+
56+
# Clone the repository
57+
RUN git clone https://github.com/VectorInstitute/vector-inference /vec-inf
58+
59+
# Set the working directory
60+
WORKDIR /vec-inf
61+
62+
# Configure Poetry to not create virtual environments
63+
RUN poetry config virtualenvs.create false
64+
65+
# Update Poetry lock file if necessary
66+
RUN poetry lock
67+
68+
# Install project dependencies via Poetry
69+
RUN poetry install
70+
71+
# Install Flash Attention 2 backend
72+
RUN python3.10 -m pip install flash-attn --no-build-isolation
73+
74+
# Move nccl to accessible location
75+
RUN mkdir -p /vec-inf/nccl
76+
RUN mv /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 /vec-inf/nccl/libnccl.so.2.18.1;
77+
78+
# Set the default command to start an interactive shell
79+
CMD ["bash"]

README.md

Lines changed: 14 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Vector Inference: Easy inference on Slurm clusters
2-
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). All scripts in this repository runs natively on the Vector Institute cluster environment, and can be easily adapted to other environments.
2+
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the config files in the `models` folder and the environment variables in the model launching scripts accordingly.
33

44
## Installation
5-
If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, all you need to do is run `pip install vllm-nccl-cu12` and go to the next section. Otherwise, you might need up to 10GB of storage to setup your own virtual environment. The following steps needs to be run only once for each user.
5+
If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, you can go to the next section as we have a default container environment in place. Otherwise, you might need up to 10GB of storage to setup your own virtual environment. The following steps needs to be run only once for each user.
66

77
1. Setup the virtual environment for running inference servers, run
88
```bash
@@ -29,7 +29,7 @@ pip install vllm-flash-attn
2929
## Launch an inference server
3030
We will use the Llama 3 model as example, to launch an inference server for Llama 3 8B, run
3131
```bash
32-
bash models/llama3/launch_server.sh
32+
bash src/launch_server.sh --model-family llama3
3333
```
3434
You should see an output like the following:
3535
> Job Name: vLLM/Meta-Llama-3-8B
@@ -44,35 +44,26 @@ You should see an output like the following:
4444
4545
If you want to use your own virtual environment, you can run this instead:
4646
```bash
47-
bash models/llama3/launch_server.sh -e $(poetry env info --path)
47+
bash src/launch_server.sh --model-family llama3 --venv $(poetry env info --path)
4848
```
49-
By default, the `launch_server.sh` script in Llama 3 folder uses the 8B variant, you can switch to other variants with the `-v` flag, and make sure to change the requested resource accordingly. More information about the flags and customizations can be found in the [`models`](models) folder. The inference server is compatible with the OpenAI `Completion` and `ChatCompletion` API. You can inspect the Slurm output files to check the inference server status.
49+
By default, the `launch_server.sh` script is set to use the 8B variant for Llama 3 based on the config file in `models/llama3` folder, you can switch to other variants with the `--model-variant` argument, and make sure to change the requested resource accordingly. More information about the flags and customizations can be found in the [`models`](models) folder. The inference server is compatible with the OpenAI `Completion` and `ChatCompletion` API. You can inspect the Slurm output files to check the inference server status.
5050

5151
Here is a more complicated example that launches a model variant using multiple nodes, say we want to launch Mixtral 8x22B, run
5252
```bash
53-
bash models/mixtral/launch_server.sh -v 8x22B-v0.1 -N 2 -n 4
53+
bash src/launch_server.sh --model-family mixtral --model-variant 8x22B-v0.1 --num-nodes 2 --num-gpus 4
54+
```
55+
56+
And for launching a multimodal model, here is an example for launching LLaVa-NEXT Mistral 7B (default variant)
57+
```bash
58+
bash src/launch_server.sh --model-family llava-next --is-vlm
5459
```
55-
The default partition for Mixtral models is a40, and we need 8 a40 GPUs to load Mixtral 8x22B, so we requested 2 a40 nodes with 4 GPUs per node. You should see an output like the following:
56-
> Number of nodes set to: 2
57-
>
58-
> Number of GPUs set to: 4
59-
>
60-
> Model variant set to: 8x22B-v0.1
61-
>
62-
> Job Name: vLLM/Mixtral-8x22B-v0.1
63-
>
64-
> Partition: a40
65-
>
66-
> Generic Resource Scheduling: gpu:8
67-
>
68-
> Data Type: auto
69-
>
70-
> Submitted batch job 12430232
7160

7261
## Send inference requests
73-
Once the inference server is ready, you can start sending in inference requests. We provide example [Python](examples/inference.py) and [Bash](examples/inference.sh) scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. You can run either `python examples/inference.py` or `bash examples/inference.sh`, and you should expect to see an output like the following:
62+
Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/completions.py`, and you should expect to see an output like the following:
7463
> {"id":"cmpl-bdf43763adf242588af07af88b070b62","object":"text_completion","created":2983960,"model":"/model-weights/Llama-2-7b-hf","choices":[{"index":0,"text":"\nCanada is close to the actual continent of North America. Aside from the Arctic islands","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":8,"total_tokens":28,"completion_tokens":20}}
7564
65+
**NOTE**: For multimodal models, currently only `ChatCompletion` is available, and only one image can be provided for each prompt.
66+
7667
## SSH tunnel from your local device
7768
If you want to run inference from your local device, you can open a SSH tunnel to your cluster environment like the following:
7869
```bash

examples/README.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# Examples
2-
`inference.py`: Python example of sending inference requests to inference server using OpenAI API, make sure to install OpenAI API in your environment.
3-
4-
`inference.sh`: Bash example of sending inference requests to inference server, supports JSON mode
5-
6-
`logits.py`: Python example of getting logits from hosted model.
2+
- [`inference`](inference): Examples for sending inference requests
3+
- [`llm/chat_completions.py`](inference/llm/chat_completions.py): Python example of sending chat completion requests to OpenAI compatible server
4+
- [`llm/completions.py`](inference/llm/completions.py): Python example of sending completion requests to OpenAI compatible server
5+
- [`llm/completions.sh`](inference/llm/completions.sh): Bash example of sending completion requests to OpenAI compatible server, supports JSON mode
6+
- [`vlm/vision_completions.py`](inference/vlm/vision_completions.py): Python example of sending chat completion requests with image attached to prompt to OpenAI compatible server for vision language models
7+
- [`logits`](logits): Example for logits generation
8+
- [`logits.py`](logits/logits.py): Python example of getting logits from hosted model.

examples/inference.py

Lines changed: 0 additions & 12 deletions
This file was deleted.

examples/inference.sh

Lines changed: 0 additions & 10 deletions
This file was deleted.
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
from openai import OpenAI
2+
3+
# The url is located in the .vLLM_model-variant_url file in the corresponding model directory.
4+
client = OpenAI(base_url="http://gpuXXX:XXXX/v1", api_key="EMPTY")
5+
6+
# Update the model path accordingly
7+
completion = client.chat.completions.create(
8+
model="/model-weights/Meta-Llama-3-8B-Instruct",
9+
messages=[
10+
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
11+
{"role": "user", "content": "Who are you?"},
12+
]
13+
)
14+
15+
print(completion)

examples/inference/llm/completions.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
from openai import OpenAI
2+
3+
# The url is located in the .vLLM_model-variant_url file in the corresponding model directory.
4+
client = OpenAI(base_url="http://gpuXXX:XXXX/v1", api_key="EMPTY")
5+
6+
# Update the model path accordingly
7+
completion = client.completions.create(
8+
model="/model-weights/Meta-Llama-3-8B",
9+
prompt="Where is the capital of Canada?",
10+
max_tokens=20,
11+
)
12+
13+
print(completion)

examples/inference/llm/completions.sh

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# The url is located in the .vLLM_model-variant_url file in the corresponding model directory.
2+
export API_BASE_URL=http://gpuXXX:XXXX/v1
3+
4+
# Update the model path accordingly
5+
curl ${API_BASE_URL}/completions \
6+
-H "Content-Type: application/json" \
7+
-d '{
8+
"model": "/model-weights/Meta-Llama-3-8B",
9+
"prompt": "What is the capital of Canada?",
10+
"max_tokens": 20
11+
}'
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
from openai import OpenAI
2+
3+
# The url is located in the .vLLM_model-variant_url file in the corresponding model directory.
4+
client = OpenAI(base_url="http://gpuXXX:XXXX/v1", api_key="EMPTY")
5+
6+
# Update the model path accordingly
7+
completion = client.chat.completions.create(
8+
model="/model-weights/llava-1.5-13b-hf",
9+
messages=[
10+
{
11+
"role": "user",
12+
"content": [
13+
{"type": "text", "text": "What's in this image?"},
14+
{
15+
"type": "image_url",
16+
"image_url": {
17+
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
18+
},
19+
},
20+
],
21+
}
22+
],
23+
max_tokens=50,
24+
)
25+
26+
print(completion)
27+

examples/logits.py renamed to examples/logits/logits.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
from openai import OpenAI
22

3-
# The url is located in the .vllm_model-variant_url file in the corresponding model directory.
3+
# The url is located in the .vLLM_model-variant_url file in the corresponding model directory.
44
client = OpenAI(base_url="http://gpuXXX:XXXXX/v1", api_key="EMPTY")
55

66
completion = client.completions.create(
7-
model="/model-weights/Llama-2-7b-hf",
7+
model="/model-weights/Meta-Llama-3-8B",
88
prompt="Where is the capital of Canada?",
9-
max_tokens=20,
9+
max_tokens=1,
1010
logprobs=32000 # Set to model vocab size to get logits
1111
)
1212

models/README.md

Lines changed: 33 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
11
# Environment Variables
22
The following environment variables all have default values that's suitable for the Vector cluster environment. You can use flags to modify certain environment variable values.
33

4-
* **MODEL_NAME**: Name of model family.
4+
* **MODEL_FAMILY**: Directory name of the model family.
5+
* **SRC_DIR**: Relative path for the `[src](../src/)` folder.
6+
* **CONFIG_FILE**: Config file containing default values for some environment variables in the **MODEL_FAMILY** diretory.
7+
* **MODEL_NAME**: Name of model family according to the actual model weights.
58
* **MODEL_VARIANT**: Variant of the model, the variants available are listed in respective model folders. Default variant is bolded in the corresponding README.md file.
69
* **MODEL_DIR**: Path to model's directory in vector-inference repo.
710
* **VLLM_BASE_URL_FILENAME**: The file to store the inference server URL, this file would be generated after launching an inference server, and it would be located in the corresponding model folder with the name `.vllm_{model-name}-{model-variant}_url`.
@@ -13,13 +16,33 @@ The following environment variables all have default values that's suitable for
1316
* **NUM_NODES**: Numeber of nodes scheduled. Default to suggested resource allocation.
1417
* **NUM_GPUS**: Number of GPUs scheduled. Default to suggested resource allocation.
1518
* **JOB_PARTITION**: Type of compute partition. Default to suggested resource allocation.
16-
* **QOS**: Quality of Service
19+
* **QOS**: Quality of Service.
20+
* **TIME**: Max Walltime.
1721

18-
# Flags
19-
* `-p`: Overrides **JOB_PARTITION**.
20-
* `-N`: Overrides **NUM_NODES**.
21-
* `-n`: Overrides **NUM_GPUS**.
22-
* `-q`: Overrides **QOS**.
23-
* `-d`: Overrides **VLLM_DATA_TYPE**.
24-
* `-e`: Overrides **VENV_BASE**.
25-
* `-v`: Overrides **MODEL_VARIANT**
22+
The following environment variables are only for Vision Language Models
23+
24+
* **CHAT_TEMPLATE**: The relative path to the chat template if no default chat template is available.
25+
* **IMAGE_INPUT_TYPE**: Possible choices: `pixel_values`, `image_features`. The image input type passed into vLLM, default to `pixel_values`.
26+
* **IMAGE_TOKEN_ID**: Input ID for image token. Default to HF Config value. Default value set according to model.
27+
* **IMAGE_INPUT_SHAPE**: The biggest image input shape (worst for memory footprint) given an input type. Only used for vLLM’s profile_run. Default value set according to model.
28+
* **IMAGE_FEATURE_SIZE**: The image feature size along the context dimension. Default value set according to model.
29+
30+
# Named Arguments
31+
NOTE: Arguments like `--num-nodes` or `model-variant` might not be available to certain model families because they should fit inside a single node or there is no variant availble in `/model-weights` yet. You can manually add these options in launch scripts if you need, or make a request to download weights for other variants.
32+
* `--model-family`: Sets **MODEL_FAMILY**, the available options are the names of each sub-directory in this directory. **This argument MUST be set.**
33+
* `--model-variant`: Overrides **MODEL_VARIANT**
34+
* `--partition`: Overrides **JOB_PARTITION**.
35+
* `--num-nodes`: Overrides **NUM_NODES**.
36+
* `--num-gpus`: Overrides **NUM_GPUS**.
37+
* `--qos`: Overrides **QOS**.
38+
* `--time`: Overrides **TIME**.
39+
* `--data-type`: Overrides **VLLM_DATA_TYPE**.
40+
* `--venv`: Overrides **VENV_BASE**.
41+
* `--is-vlm`: Specifies this is a Vision Language Model, no value needed.
42+
43+
The following flags are only available to Vision Language Models
44+
45+
* `--image-input-type`: Overrides **IMAGE_INPUT_TYPE**
46+
* `--image-token-id`: Overrides **IMAGE_TOKEN_ID**
47+
* `--image-input-shape`: Overrides **IMAGE_INPUT_SHAPE**
48+
* `--image-feature-size`: Overrides **IMAGE_FEATURE_SIZE**

models/command-r/config.sh

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
export MODEL_NAME="c4ai-command-r"
2+
export MODEL_VARIANT="plus"
3+
export NUM_NODES=2
4+
export NUM_GPUS=4
5+
export VLLM_MAX_LOGPROBS=256000

0 commit comments

Comments
 (0)