A hyperparameter optimization framework for vLLM serving with local and optional Ray execution backends, built with Optuna.
Note: This is a maintained fork
This repository is a fork of the openshift-psap/auto-tuning-vllm project. We are grateful to the original authors for providing the foundation this fork builds upon. This fork was created to address specific needs in our environment that differ from the original project's scope.
This fork was created to adapt the original framework to our specific requirements:
- Simpler deployment for single-node scenarios - Ray is optional, not required
- Testing infrastructure to support safe evolution of the codebase
- Active maintenance for our production use cases (dependency updates, bug fixes)
- Feature expansion as needed for our workloads (additional inference engines, benchmark tools)
- 🎯 Flexible Backends: Run locally (default) or optionally on Ray clusters
- 📊 Benchmarking: Built-in GuideLLM support
- 🗄️ Flexible Storage: SQLite for local use, PostgreSQL for production (optional)
- ⚙️ Easy Configuration: YAML-based study and parameter configuration
- 📈 Multi-Objective: Support for throughput vs latency trade-offs
For a detailed starter guide, see the Quick Start Guide.
Install the base package for local execution. Add the optional ray extra only if you want distributed execution.
# Clone the maintained fork
git clone https://github.com/InseeFrLab/auto-tuning-vllm.git
cd auto-tuning-vllm
# Basic installation (local execution only)
pip install -e .
# Optional: Install with Ray support for distributed execution
pip install -e ".[ray]"
# Optional: Install with PostgreSQL support
pip install -e ".[postgresql]"# Run optimization study locally (default backend)
auto-tune-vllm optimize --config config.yaml --max-concurrent-trials 2
# Run optimization study on Ray
auto-tune-vllm optimize --config config.yaml --backend ray --venv-path ./venv --max-concurrent-trials 2
# Resume interrupted study
auto-tune-vllm resume --study-name study_35884
# Stream live logs
auto-tune-vllm logs --study-name study_35884- Quick Start Guide - Get running in 5 minutes
- Configuration Reference - Complete YAML configuration guide
- Ray Cluster Setup - For distributed optimization (optional)
- Python 3.10+
- NVIDIA GPU with CUDA support (for running vLLM)
- SQLite (included) or PostgreSQL (optional)
Core dependencies are installed with pip install -e .. Ray is optional and available via pip install -e ".[ray]".
This fork is actively being improved. Current work in progress:
- Add comprehensive test suite
- Expand CI/CD to run tests, not just linting
- Dependency hygiene - pin versions, reduce heavy core dependencies
- Improve CLI error messages and validation
- Support for speculative decoding parameters
- Additional benchmark providers beyond GuideLLM
- Support for alternative inference engines (e.g., SGLang)
- Better parameter validation against vLLM CLI args
This fork welcomes contributions. Priority areas:
- Testing - Adding tests for existing functionality
- Documentation - Improving guides and examples
- Core stability - Bug fixes and edge case handling
Apache License 2.0 - see LICENSE file for details.