Skip to content

Files

Latest commit

Jul 18, 2025
7545048 · Jul 18, 2025

History

History
This branch is 874 commits ahead of, 452 commits behind vllm-project/vllm:main.

docs

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
May 23, 2025
May 23, 2025
Jun 23, 2025
Jun 23, 2025
Jun 22, 2025
Jun 23, 2025
Jun 26, 2025
Jun 26, 2025
Jun 30, 2025
Jul 18, 2025
Jun 23, 2025
Jun 25, 2025
Jun 30, 2025
Jun 30, 2025
Jun 27, 2025
May 26, 2025
Jun 23, 2025
Jun 26, 2025
Jun 26, 2025

Welcome to vLLM

![](./assets/logos/vllm-logo-text-light.png){ align="center" alt="vLLM" class="no-scaled-link" width="60%" }

Easy, fast, and cheap LLM serving for everyone

<script async defer src="https://buttons.github.io/buttons.js"></script> Star Watch Fork

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantization: GPTQ, AWQ, INT4, INT8, and FP8
  • Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
  • Speculative decoding
  • Chunked prefill

vLLM is flexible and easy to use with:

  • Seamless integration with popular HuggingFace models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism and pipeline parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
  • Prefix caching support
  • Multi-LoRA support

For more information, check out the following: