docs

Merge remote-tracking branch 'upstream/main'

Jul 29, 2025

753b68c · Jul 29, 2025

This branch is 880 commits ahead of, 297 commits behind vllm-project/vllm:main.

Name	Name	Last commit message	Last commit date
parent directory ..
api	api	[V0 Deprecation] Remove Prompt Adapters (vllm-project#20588 )	Jul 23, 2025
assets	assets	[TPU] Add an optimization doc on TPU (vllm-project#21155 )	Jul 29, 2025
cli	cli	Add full serve CLI reference back to docs (vllm-project#20978 )	Jul 15, 2025
community	community	Stop using title frontmatter and fix doc that can only be reached by …	Jul 8, 2025
configuration	configuration	[TPU] Add an optimization doc on TPU (vllm-project#21155 )	Jul 29, 2025
contributing	contributing	[Docs] Fix the outdated URL for installing from vLLM binaries (vllm-p…	Jul 29, 2025
deployment	deployment	[Docs] Add intro and fix 1-2-3 list in frameworks/open-webui.md (vllm…	Jul 16, 2025
design	design	[Docs] Merge design docs for a V1 only future (vllm-project#21832 )	Jul 29, 2025
dev-docker	dev-docker	Update test-pipeline.yaml (#599 )	Jul 18, 2025
features	features	[Doc] Update compatibility matrix for pooling and multimodal models (v…	Jul 29, 2025
getting_started	getting_started	[Docs] Fix the outdated URL for installing from vLLM binaries (vllm-p…	Jul 29, 2025
mkdocs	mkdocs	Use `metavar` to list the choices for a CLI arg when custom values ar…	Jul 28, 2025
models	models	[Doc] Link to RFC for pooling optimizations (vllm-project#21806 )	Jul 29, 2025
serving	serving	[Deprecation][2/N] Replace `--task` with `--runner` and `--convert` (v…	Jul 28, 2025
training	training	Add Unsloth to RLHF.md (vllm-project#21636 )	Jul 26, 2025
usage	usage	[Docs] [V1] Update docs to remove enforce_eager limitation for hybrid…	Jul 19, 2025
.nav.yml	.nav.yml	[Docs] Merge design docs for a V1 only future (vllm-project#21832 )	Jul 29, 2025
README.md	README.md	[Docs] Data Parallel deployment documentation (vllm-project#20768 )	Jul 11, 2025

README.md

Welcome to vLLM

![](./assets/logos/vllm-logo-text-light.png){ align="center" alt="vLLM Light" class="logo-light" width="60%" } ![](./assets/logos/vllm-logo-text-dark.png){ align="center" alt="vLLM Dark" class="logo-dark" width="60%" }

Easy, fast, and cheap LLM serving for everyone

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding
Chunked prefill

vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor, pipeline, data and expert parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
Prefix caching support
Multi-LoRA support

For more information, check out the following:

vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al.
vLLM Meetups

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

docs

docs

README.md

Welcome to vLLM

Collapse file tree

Files

docs

Directory actions

More options

Directory actions

More options

Latest commit

History

docs

Folders and files

parent directory

README.md

Welcome to vLLM