This guide walks through the process of setting up and running the Qwen2.5-14B-Instruct model using vLLM on AWS EC2 instances, with solutions for common storage and configuration issues.
- AWS EC2 instance with NVIDIA GPU (recommended: g4dn.xlarge or higher)
- NVIDIA drivers installed
- Docker and NVIDIA Container Toolkit installed
- At least 100GB of storage space (preferably more)
Before running vLLM with Hugging Face models, log in to the Hugging Face CLI to provide your access token. This is required for downloading most large models (including Qwen2.5-14B-Instruct) and for accessing private models.
-
Install the Hugging Face CLI (if not already):
pip install --upgrade huggingface_hub
-
Log in to Hugging Face:
huggingface-cli login
- Paste your access token when prompted. You can get your token from https://huggingface.co/settings/tokens
-
(Optional) Set the token as an environment variable for Docker:
export HUGGING_FACE_HUB_TOKEN=your_token_here- This allows Docker containers to access the token for model downloads.
Use an AWS Deep Learning AMI (DLAMI) with GPU support to simplify driver installation:
# Example instance types
# g4dn.xlarge - 1 GPU, 4 vCPUs, 16 GB RAM
# g4dn.2xlarge - 1 GPU, 8 vCPUs, 32 GB RAM
# g5.xlarge - 1 GPU, 4 vCPUs, 16 GB RAMAttach a large EBS volume (300GB recommended) to your instance.
Ensure Docker and NVIDIA Container Toolkit are properly installed:
# Check NVIDIA drivers
nvidia-smi
# Check Docker installation
docker --version
# Check NVIDIA Docker integration
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi./setup_docker.sh# Check Docker's root directory
docker info | grep "Docker Root Dir"
# Should show: Docker Root Dir: /opt/dlami/nvme/docker./run_vllm.sh Qwen/Qwen2.5-7B-InstructIf you encounter space issues despite configuring Docker to use the larger volume:
-
Verify Docker is using the correct storage location:
docker info | grep "Docker Root Dir"
-
Clean Docker resources completely:
docker system prune -a -f --volumes
-
Check available space on both volumes:
df -h
-
Try pulling the image separately:
docker pull vllm/vllm-openai:latest
If you encounter issues loading the GGUF model:
-
Make sure you're using absolute paths for volume mounts:
CURRENT_DIR=$(pwd) MODEL_DIR="${CURRENT_DIR}/models" CONFIG_DIR="${CURRENT_DIR}/config"
-
Try the direct Hugging Face approach if GGUF isn't working:
docker run -it \ --runtime nvidia \ --gpus all \ --network="host" \ --ipc=host \ -v "${CONFIG_DIR}:/config" \ vllm/vllm-openai:latest \ --model "Qwen/Qwen2.5-14B-Instruct" \ --quantization awq \ --host "0.0.0.0" \ --port 5000 \ --gpu-memory-utilization 0.9 \ --served-model-name "VLLMQwen2.5-14B" \ --max-num-batched-tokens 8192 \ --max-num-seqs 256 \ --max-model-len 8192 \ --generation-config /config -
Check vLLM documentation for specific GGUF loading requirements: vLLM GGUF Support
If Docker can't access the GPUs:
-
Verify NVIDIA drivers are working:
nvidia-smi
-
Check NVIDIA Container Toolkit configuration:
sudo nano /etc/docker/daemon.json
Ensure it contains the NVIDIA runtime configuration.
-
Restart Docker:
sudo systemctl restart docker
Once the server is running, you can make API calls to generate text:
curl http://localhost:5000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "VLLMQwen2.5-14B",
"prompt": "Write a short poem about artificial intelligence:",
"max_tokens": 256,
"temperature": 0.7
}'curl http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "VLLMQwen2.5-14B",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the three laws of robotics?"}
],
"max_tokens": 256,
"temperature": 0.7
}'curl http://localhost:5000/v1/modelscurl http://localhost:5000/healthFor streaming responses (similar to how ChatGPT provides tokens incrementally):
curl http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "VLLMQwen2.5-14B",
"messages": [
{"role": "user", "content": "Explain quantum computing in simple terms"}
],
"max_tokens": 256,
"temperature": 0.7,
"stream": true
}'You can also access the OpenAPI documentation by opening http://localhost:5000/docs in your browser, which will show you all the available endpoints and parameters.
For easier interaction with the vLLM API, several utility scripts have been created:
Create a script to test the chat completion API with pretty formatting:
#!/bin/bash
# Test vLLM chat completion API
curl http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-14B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, how are you?"}
],
"temperature": 0.7,
"max_tokens": 100
}' | jqSave as test-query.sh and make it executable with chmod +x test-query.sh.
Create a script to list available models with readable output:
#!/bin/bash
# List available models in vLLM server
curl http://localhost:5000/v1/models | jqSave as list-models.sh and make it executable with chmod +x list-models.sh.
A simple web interface using Gradio can be created to interact with vLLM:
./gui-demo.shSave as gui-demo.sh and make it executable with chmod +x gui-demo.sh.
To run the web interface:
./gui-demo.shYou can customize the settings with command-line parameters:
./gui-demo.sh --model "Qwen2.5-14B-Instruct" --api-url "http://localhost:5000/v1" --temperature 0.7 --port 7860Several third-party UI options are also available for vLLM:
-
nextjs-vllm-ui - A beautiful ChatGPT-like interface
- GitHub: https://github.com/yoziru/nextjs-vllm-ui
- Run with Docker:
docker run --rm -d -p 3000:3000 -e VLLM_URL=http://host.docker.internal:5000 ghcr.io/yoziru/nextjs-vllm-ui:latest
-
Open WebUI - A full-featured web interface that works with vLLM
- Can be configured to use vLLM as the backend instead of Ollama
- Example Docker command:
docker run -d -p 3000:8080 \ --name open-webui \ --restart always \ --env=OPENAI_API_BASE_URL=http://<your-ip>:5000/v1 \ --env=OPENAI_API_KEY=your-api-key \ --env=ENABLE_OLLAMA_API=false \ ghcr.io/open-webui/open-webui:main
-
vllm-ui - A simple Gradio-based interface designed for Vision Language Models
- GitHub: https://github.com/sammcj/vlm-ui
Adjust these parameters in the run_vllm.sh script based on your hardware:
--gpu-memory-utilization: Value between 0 and 1 (default: 0.9)--max-num-batched-tokens: Increase for higher throughput, decrease if out of memory--max-num-seqs: Maximum number of sequences in the batch--max-model-len: Maximum sequence length--tensor-parallel-size: Set to number of GPUs if using multiple GPUs