Setting Up vLLM with Qwen2.5-14B-Instruct on AWS

This guide walks through the process of setting up and running the Qwen2.5-14B-Instruct model using vLLM on AWS EC2 instances, with solutions for common storage and configuration issues.

Prerequisites

AWS EC2 instance with NVIDIA GPU (recommended: g4dn.xlarge or higher)
NVIDIA drivers installed
Docker and NVIDIA Container Toolkit installed
At least 100GB of storage space (preferably more)

Hugging Face Login (Required for Private or Large Model Downloads)

Before running vLLM with Hugging Face models, log in to the Hugging Face CLI to provide your access token. This is required for downloading most large models (including Qwen2.5-14B-Instruct) and for accessing private models.

Install the Hugging Face CLI (if not already):
```
pip install --upgrade huggingface_hub
```
Log in to Hugging Face:
```
huggingface-cli login
```
- Paste your access token when prompted. You can get your token from https://huggingface.co/settings/tokens
(Optional) Set the token as an environment variable for Docker:
```
export HUGGING_FACE_HUB_TOKEN=your_token_here
```
- This allows Docker containers to access the token for model downloads.

Initial Setup

1. Create EC2 Instance

Use an AWS Deep Learning AMI (DLAMI) with GPU support to simplify driver installation:

# Example instance types
# g4dn.xlarge - 1 GPU, 4 vCPUs, 16 GB RAM
# g4dn.2xlarge - 1 GPU, 8 vCPUs, 32 GB RAM
# g5.xlarge - 1 GPU, 4 vCPUs, 16 GB RAM

Attach a large EBS volume (300GB recommended) to your instance.

1. Configure Docker for NVIDIA Support

Ensure Docker and NVIDIA Container Toolkit are properly installed:

# Check NVIDIA drivers
nvidia-smi

# Check Docker installation
docker --version

# Check NVIDIA Docker integration
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

2. Setup docker

./setup_docker.sh

3. Verify Docker Storage Location (optional)

# Check Docker's root directory
docker info | grep "Docker Root Dir"

# Should show: Docker Root Dir: /opt/dlami/nvme/docker

Running the vLLM Container

./run_vllm.sh Qwen/Qwen2.5-7B-Instruct

Troubleshooting

"No space left on device" Error

If you encounter space issues despite configuring Docker to use the larger volume:

Verify Docker is using the correct storage location:
```
docker info | grep "Docker Root Dir"
```
Clean Docker resources completely:
```
docker system prune -a -f --volumes
```
Check available space on both volumes:
```
df -h
```
Try pulling the image separately:
```
docker pull vllm/vllm-openai:latest
```

GGUF Model Loading Issues

If you encounter issues loading the GGUF model:

Make sure you're using absolute paths for volume mounts:

CURRENT_DIR=$(pwd)
MODEL_DIR="${CURRENT_DIR}/models"
CONFIG_DIR="${CURRENT_DIR}/config"

Try the direct Hugging Face approach if GGUF isn't working:

docker run -it \
    --runtime nvidia \
    --gpus all \
    --network="host" \
    --ipc=host \
    -v "${CONFIG_DIR}:/config" \
    vllm/vllm-openai:latest \
    --model "Qwen/Qwen2.5-14B-Instruct" \
    --quantization awq \
    --host "0.0.0.0" \
    --port 5000 \
    --gpu-memory-utilization 0.9 \
    --served-model-name "VLLMQwen2.5-14B" \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 256 \
    --max-model-len 8192 \
    --generation-config /config

Check vLLM documentation for specific GGUF loading requirements: vLLM GGUF Support

GPU Issues

If Docker can't access the GPUs:

Verify NVIDIA drivers are working:
```
nvidia-smi
```
Check NVIDIA Container Toolkit configuration:
```
sudo nano /etc/docker/daemon.json
```
Ensure it contains the NVIDIA runtime configuration.
Restart Docker:
```
sudo systemctl restart docker
```

Using the API

Once the server is running, you can make API calls to generate text:

Test the Completions Endpoint

curl http://localhost:5000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "VLLMQwen2.5-14B",
    "prompt": "Write a short poem about artificial intelligence:",
    "max_tokens": 256,
    "temperature": 0.7
  }'

Test the Chat Completions Endpoint

curl http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "VLLMQwen2.5-14B",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What are the three laws of robotics?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

Check Available Models

curl http://localhost:5000/v1/models

Test Server Health

curl http://localhost:5000/health

Test With Streaming Response

For streaming responses (similar to how ChatGPT provides tokens incrementally):

curl http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "VLLMQwen2.5-14B",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    "max_tokens": 256,
    "temperature": 0.7,
    "stream": true
  }'

You can also access the OpenAPI documentation by opening http://localhost:5000/docs in your browser, which will show you all the available endpoints and parameters.

Utility Scripts and GUI

For easier interaction with the vLLM API, several utility scripts have been created:

1. Test Query Script

Create a script to test the chat completion API with pretty formatting:

#!/bin/bash

# Test vLLM chat completion API
curl http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-14B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }' | jq

Save as test-query.sh and make it executable with chmod +x test-query.sh.

2. List Models Script

Create a script to list available models with readable output:

#!/bin/bash

# List available models in vLLM server
curl http://localhost:5000/v1/models | jq

Save as list-models.sh and make it executable with chmod +x list-models.sh.

3. Gradio GUI Interface

A simple web interface using Gradio can be created to interact with vLLM:

./gui-demo.sh

Save as gui-demo.sh and make it executable with chmod +x gui-demo.sh.

Running the GUI

To run the web interface:

./gui-demo.sh

You can customize the settings with command-line parameters:

./gui-demo.sh --model "Qwen2.5-14B-Instruct" --api-url "http://localhost:5000/v1" --temperature 0.7 --port 7860

Third-Party UIs

Several third-party UI options are also available for vLLM:

nextjs-vllm-ui - A beautiful ChatGPT-like interface
- GitHub: https://github.com/yoziru/nextjs-vllm-ui
- Run with Docker: docker run --rm -d -p 3000:3000 -e VLLM_URL=http://host.docker.internal:5000 ghcr.io/yoziru/nextjs-vllm-ui:latest

Open WebUI - A full-featured web interface that works with vLLM

Can be configured to use vLLM as the backend instead of Ollama

Example Docker command:

docker run -d -p 3000:8080 \
  --name open-webui \
  --restart always \
  --env=OPENAI_API_BASE_URL=http://<your-ip>:5000/v1 \
  --env=OPENAI_API_KEY=your-api-key \
  --env=ENABLE_OLLAMA_API=false \
  ghcr.io/open-webui/open-webui:main

vllm-ui - A simple Gradio-based interface designed for Vision Language Models
- GitHub: https://github.com/sammcj/vlm-ui

Performance Tuning

Adjust these parameters in the run_vllm.sh script based on your hardware:

--gpu-memory-utilization: Value between 0 and 1 (default: 0.9)
--max-num-batched-tokens: Increase for higher throughput, decrease if out of memory
--max-num-seqs: Maximum number of sequences in the batch
--max-model-len: Maximum sequence length
--tensor-parallel-size: Set to number of GPUs if using multiple GPUs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Setting Up vLLM with Qwen2.5-14B-Instruct on AWS

Prerequisites

Hugging Face Login (Required for Private or Large Model Downloads)

Initial Setup

1. Create EC2 Instance

1. Configure Docker for NVIDIA Support

2. Setup docker

3. Verify Docker Storage Location (optional)

Running the vLLM Container

Troubleshooting

"No space left on device" Error

GGUF Model Loading Issues

GPU Issues

Using the API

Test the Completions Endpoint

Test the Chat Completions Endpoint

Check Available Models

Test Server Health

Test With Streaming Response

Utility Scripts and GUI

1. Test Query Script

2. List Models Script

3. Gradio GUI Interface

Running the GUI

Third-Party UIs

Performance Tuning

Additional Resources

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
models		models
.gitignore		.gitignore
README.md		README.md
gui-demo.sh		gui-demo.sh
list-models.sh		list-models.sh
run_vllm.sh		run_vllm.sh
setup_docker.sh		setup_docker.sh
test-query.sh		test-query.sh

schmitech/vllm-workspace

Folders and files

Latest commit

History

Repository files navigation

Setting Up vLLM with Qwen2.5-14B-Instruct on AWS

Prerequisites

Hugging Face Login (Required for Private or Large Model Downloads)

Initial Setup

1. Create EC2 Instance

1. Configure Docker for NVIDIA Support

2. Setup docker

3. Verify Docker Storage Location (optional)

Running the vLLM Container

Troubleshooting

"No space left on device" Error

GGUF Model Loading Issues

GPU Issues

Using the API

Test the Completions Endpoint

Test the Chat Completions Endpoint

Check Available Models

Test Server Health

Test With Streaming Response

Utility Scripts and GUI

1. Test Query Script

2. List Models Script

3. Gradio GUI Interface

Running the GUI

Third-Party UIs

Performance Tuning

Additional Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages