A simple solution for benchmarking vLLM, SGLang, and TensorRT-LLM on Modal. ⏱️
pip install -e .
To run a single benchmark, you can use the run_benchmark
command, which will save your results to a local file.
For example, to run a synchronous (one request after another) benchmark with vLLM and save the results to results.json
:
LLM_SERVER_TYPE=vllm
MODEL=meta-llama/Llama-3.1-8B-Instruct
OUTPUT_PATH=results.json
stopwatch provision-and-benchmark $MODEL --output-path $OUTPUT_PATH --llm-server-type $LLM_SERVER_TYPE
Or, to run a fixed-rate (e.g. 5 requests per second) multi-GPU benchmark with SGLang:
GPU_COUNT=4
GPU_TYPE=H100
LLM_SERVER_TYPE=sglang
RATE_TYPE=constant
REQUESTS_PER_SECOND=5
stopwatch provision-and-benchmark $MODEL --output-path $OUTPUT_PATH --gpu "$GPU_TYPE:$GPU_COUNT" --model $MODEL --llm-server-type $LLM_SERVER_TYPE --rate-type $RATE_TYPE --rate $REQUESTS_PER_SECOND --llm-server-config "{\"extra_args\": [\"--tp-size\", \"$GPU_COUNT\"]}"
Or, to run a throughput (as many requests as the server can handle) test with TensorRT-LLM:
LLM_SERVER_TYPE=tensorrt-llm
RATE_TYPE=throughput
stopwatch provision-and-benchmark $MODEL --output-path $OUTPUT_PATH --llm-server-type $LLM_SERVER_TYPE --rate-type $RATE_TYPE
To profile vLLM with the PyTorch profiler, use the following command:
MODEL=meta-llama/Llama-3.1-8B-Instruct
NUM_REQUESTS=10
OUTPUT_PATH=trace.json.gz
stopwatch profile $MODEL --output-path $OUTPUT_PATH --num-requests $NUM_REQUESTS
Once the profiling is done, the trace will be saved to trace.json.gz
, which you can open and visualize at https://ui.perfetto.dev.
Keep in mind that generated traces can get very large, so it is recommended to only send a few requests while profiling.
We welcome contributions, including those that add tuned benchmarks to our collection. See the CONTRIBUTING file and the Getting Started document for more details on contributing to Stopwatch.
Stopwatch is available under the MIT license. See the LICENSE file for more details.