LLM inference performance validation with vLLM on the AMD Instinct MI300X accelerator

${\textsf{\color{magenta}This repo will be also available at }}$ https://github.com/ROCm/MAD ${\textsf{\color{magenta} soon}}$

LLM inference performance validation with vLLM on the AMD Instinct MI300X accelerator

Overview 🎉

vLLM is a toolkit and library for large language model (LLM) inference and serving. It deploys the PagedAttention algorithm, which reduces memory consumption and increases throughput by leveraging dynamic key and value allocation in GPU memory. vLLM also incorporates many recent LLM acceleration and quantization algorithms. In addition, AMD implements high-performance custom kernels and modules in vLLM to enhance performance further.

This Docker image packages vLLM with PyTorch for an AMD Instinct™ MI300X accelerator. It includes:

✅ ROCm™ 6.2.1
✅ vLLM 0.6.3
✅ PyTorch 2.5.0
✅ Tuning files (.csv format)

With this docker image, the users can quickly validate expected inference performance numbers on the MI 300 accelerator. We also provide tips and techniques so that users can get optimal performance with popular AI models.

Reproducing benchmark results 🚀

Use the following instructions to reproduce the benchmark results on an MI300X accelerator with a prebuilt vLLM Docker image.

NUMA balancing setting

To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU might hang until the periodic balancing is finalized. For further details, refer to the AMD Instinct MI300X system optimization guide.

# disable automatic NUMA balancing
sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
# check if NUMA balancing is disabled (returns 0 if disabled)
cat /proc/sys/kernel/numa_balancing
0

Download the Docker image 🐳

The following command pulls the Docker image from Docker Hub and launches a new Docker instance (vllm_mi300x).

docker pull rocm/vllm-dev:vllm-20241009-tuned # TODO: update to the final public image
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name unified_docker_vllm rocm/vllm-dev:vllm-20241009-tuned

LLM performance settings

Some environment variables enhance the performance of the vLLM kernels and PyTorch's tunableOp on the MI300X accelerator. The docker image is already preconfigured to include the performance settings. See the AMD Instinct MI300X workload optimization guide for more information.

Multiprocessing distributed executor

To optimize vLLM performance, We recommend using multiprocessing API server argument. Adding the --distributed-executor-backend mp does so.

Copy the repository from GitHub

Copy the performance benchmarking scripts from GitHub to a local directory.

git clone https://github.com/seungrokj/unified_docker_benchmark_public # TODO: this repo will be also available at https://github.com/ROCm/MAD soon
cd unified_docker_benchmark_public

Methodology

Use the following command and variables to run the benchmark tests.

Command

./vllm_benchmark_report.sh -s $test_option -m $model_repo -g $num_gpu -d $datatype

Note: The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don't need to specify them with this script.
Note: If you encounter this error, you need to pass your access-authorized huggingface token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Variables

Name	Options	Description
$test_option	latency	Measure decoding token latency
	throughput	Measure token generation throughput
	all	Measure both throughput and latency
$model_repo	meta-llama/Meta-Llama-3.1-8B-Instruct	Llama 3.1 8B
(float16)	meta-llama/Meta-Llama-3.1-70B-Instruct	Llama 3.1 70B
	meta-llama/Meta-Llama-3.1-405B-Instruct	Llama 3.1 405B
	meta-llama/Llama-2-7b-chat-hf	Llama 2 7B
	meta-llama/Llama-2-70b-chat-hf	Llama 2 70B
	mistralai/Mixtral-8x7B-Instruct-v0.1	Mistral 8x7B
	mistralai/Mixtral-8x22B-Instruct-v0.1	Mistral 8x22B
	mistralai/Mistral-7B-Instruct-v0.3	Mistral 7B
	Qwen/Qwen2-7B-Instruct	Qwen2 7B
	Qwen/Qwen2-72B-Instruct	Qwen2 72B
	core42/jais-13b-chat	JAIS 13B
	core42/jais-30b-chat-v3	JAIS 30B
$model_repo	amd/Meta-Llama-3.1-8B-Instruct-FP8-KV	Llama 3.1 8B
(float8)	amd/Meta-Llama-3.1-70B-Instruct-FP8-KV	Llama 3.1 70B
	amd/Meta-Llama-3.1-405B-Instruct-FP8-KV	Llama 3.1 405B
	amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV	Mistral 8x7B
	amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV	Mistral 8x22B
$num_gpu	1 or 8	Number of GPUs.
$datatype	float16, float8

Run the benchmark tests on the MI300X accelerator 🏃

Here are some examples and the test results:

Benchmark example - latency

Use this command to benchmark the latency of the Llama 3.1 8B model on one GPU with the float16 and float8 data type.

./vllm_benchmark_report.sh -s latency -m meta-llama/Meta-Llama-3.1-8B-Instruct -g 1 -d float16
./vllm_benchmark_report.sh -s latency -m amd/Meta-Llama-3.1-8B-Instruct-FP8-KV -g 1 -d float8

You can find the latency report at ./reports_float16/summary/Meta-Llama-3.1-8B-Instruct_latency_report.csv. You can find the latency report at ./reports_float8/summary/Meta-Llama-3.1-8B-Instruct-FP8-KV_latency_report.csv.

Benchmark example - throughput

Use this command to benchmark the throughput of the Llama 3.1 8B model on one GPU with the float16 and float8 data type.

./vllm_benchmark_report.sh -s throughput -m meta-llama/Meta-Llama-3.1-8B-Instruct -g 1 -d float16
./vllm_benchmark_report.sh -s throughput -m amd/Meta-Llama-3.1-8B-Instruct-FP8-KV -g 1 -d float8

You can find the throughput report at ./reports_float16/summary/Meta-Llama-3.1-8B-Instruct_throughput_report.csv. You can find the throughput report at ./reports_float8/summary/Meta-Llama-3.1-8B-Instruct-FP8-KV_throughput_report.csv.

throughput_tot = requests * (input lengths + output lengths) / elapsed_time
throughput_gen = requests * output lengths / elapsed_time

References 🔎

For an overview of the optional performance features of vLLM with ROCm software, see https://github.com/ROCm/vllm/blob/main/ROCm_performance.md.

To learn more about the options for latency and throughput benchmark scripts, see https://github.com/ROCm/vllm/tree/main/benchmarks.

To learn how to run LLM models from Hugging Face or your own model, see the Using ROCm for AI section of the ROCm documentation.

To learn how to optimize inference on LLMs, see the Fine-tuning LLMs and inference optimization section of the ROCm documentation.

For a list of other ready-made Docker images for ROCm, see the ROCm Docker image support matrix.

Licensing Information ⚠️

Your use of this application is subject to the terms of the applicable component-level license identified below. To the extent any subcomponent in this container requires an offer for corresponding source code, AMD hereby makes such an offer for corresponding source code form, which will be made available upon request. By accessing and using this application, you are agreeing to fully comply with the terms of this license. If you do not agree to the terms of this license, do not access or use this application.

The application is provided in a container image format that includes the following separate and independent components:

Package	License	URL
Ubuntu	Creative Commons CC-BY-SA Version 3.0 UK License	Ubuntu Legal
ROCm	Custom/MIT/Apache V2.0/UIUC OSL	ROCm Licensing Terms
PyTorch	Modified BSD	PyTorch License
vLLM	Apache License 2.0	vLLM License

Disclaimer

The information contained herein is for informational purposes only and is subject to change without notice. In addition, any stated support is planned and is also subject to change. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD's products are as set forth in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale.

Notices and attribution

Docker and the Docker logo are trademarks or registered trademarks of Docker, Inc. in the United States and/or other countries. Docker, Inc. and other parties may also have trademark rights in other terms used herein. Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries.    

All other trademarks and copyrights are property of their respective owners and are only mentioned for informative purposes.   

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM inference performance validation with vLLM on the AMD Instinct MI300X accelerator

Overview 🎉

Reproducing benchmark results 🚀

NUMA balancing setting

Download the Docker image 🐳

LLM performance settings

Multiprocessing distributed executor

Copy the repository from GitHub

Methodology

Command

Variables

Run the benchmark tests on the MI300X accelerator 🏃

References 🔎

Licensing Information ⚠️

Disclaimer

Notices and attribution

About

Releases

Packages

Contributors 2

seungrokj/unified_docker_readme_public

Folders and files

Latest commit

History

Repository files navigation

LLM inference performance validation with vLLM on the AMD Instinct MI300X accelerator

Overview 🎉

Reproducing benchmark results 🚀

NUMA balancing setting

Download the Docker image 🐳

LLM performance settings

Multiprocessing distributed executor

Copy the repository from GitHub

Methodology

Command

Variables

Run the benchmark tests on the MI300X accelerator 🏃

References 🔎

Licensing Information ⚠️

Disclaimer

Notices and attribution

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages