[Bug] Different behavior benchmarking w/ request-range-range vs. separate request-rates #2470

Mutinifni · 2024-12-12T19:51:36Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

Plotting rate-vs-latency graph with request-rate-range vs. individual request rates shows different behavior. Since the bench_serving already includes warmup, I'm not sure why this is happening -- maybe the warmup needs to be more substantial? It is also a bit surprising to see latency spike-then-drop rather than a knee. If something in my methodology is incorrect, please let me know.

Note that I've only tested this on random prompt processing with 128 input size and 1 output size. I ran the benchmarks twice and noticed different behavior both times.

Here are the graphs:

Reproduction

Model: DeepSeek-V2-Lite

First I ran with infinite request rate to find the max request rate, which was ~165 requests per second. Based on this, I ran the following commands:

Server started using:

model_path="/ssd/models/deepseek-ai/DeepSeek-V2-Lite"
python -m sglang.launch_server \
    --model-path $model_path \
    --log-level critical \
    --trust-remote-code \
    --disable-radix-cache \
    --mem-fraction-static 0.9 \
    --tp-size 8

Request rate range commands:

python -m sglang.bench_serving \
    --disable-tqdm \
    --dataset-name random \
    --multi --request-rate-range 80,260,10 \
    --random-input-len 128 \
    --random-output-len 1 \
    --random-range-ratio 1.0 \
    --num-prompts 1000

Individual request rate commands (server restarted for each request rate value):

python -m sglang.bench_serving \
    --disable-tqdm \
    --dataset-name random \
    --multi --request-rate-range $request_rate \
    --random-input-len 128 \
    --random-output-len 1 \
    --random-range-ratio 1.0 \
    --num-prompts 1000

Environment

[2024-12-12 19:25:01] INFO _client.py:1027: HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_wi
ndow.json "HTTP/1.1 200 OK"
Python: 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-80GB
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.6, V12.6.77
CUDA Driver Version: 565.57.01
PyTorch: 2.5.1+cu124
sglang: 0.4.0.post1
flashinfer: 0.1.6+cu121torch2.4
triton: 3.1.0
transformers: 4.46.3
torchao: 0.6.1
numpy: 1.26.4
aiohttp: 3.11.9
fastapi: 0.115.5
hf_transfer: 0.1.8
huggingface_hub: 0.26.3
interegular: 0.3.3
modelscope: 1.20.1
orjson: 3.10.12
packaging: 24.2
psutil: 6.1.0
pydantic: 2.10.3
multipart: 0.0.19
zmq: 26.2.0
uvicorn: 0.32.1
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.56.1
anthropic: 0.40.0
decord: 0.6.0
NVIDIA Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8   CPU Affinity     NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS    24-47    1               N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS    24-47    1               N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     NODE   0-23     0               N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     NODE   0-23     0               N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS    72-95    3               N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS    72-95    3               N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    SYS    48-71    2               N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    SYS    48-71    2               N/A
NIC0    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS      X      NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC2    SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS      X      NODE    SYS     SYS     SYS     SYS     NODE
NIC3    SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS     SYS     SYS     NODE
NIC4    SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS      X      NODE    SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS      X      NODE    SYS
NIC7    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE     X      SYS
NIC8    SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8


Hypervisor vendor: Microsoft
ulimit soft: 1024

The text was updated successfully, but these errors were encountered:

Mutinifni · 2024-12-16T18:34:30Z

Likely related: benchmark performance numbers are not very consistent, even with offline benchmarks.

For example, I ran this same command twice, and saw substantial performance difference (~20%):

python -m sglang.bench_offline_throughput --model-path /ssd/models/deepseek-ai/DeepSeek-V2-Lite --trust-remote-code --disable-radix-cache --mem-fraction-static 0.9 --dataset-name random --r
andom-input-len 128 --random-output-len 1 --random-range-ratio 1.0 --num-prompts 1000 --tp-size 8


Successful requests:                     1000
Benchmark duration (s):                  6.48                                                                                                   
Total input tokens:                      128000
Total generated tokens:                  1000                                                                                                   
Request throughput (req/s):              154.30
Input token throughput (tok/s):          19750.00                                                                                               
Output token throughput (tok/s):         154.30
Total token throughput (tok/s):          19904.30

Backend:                                 engine                                                                                                 
Successful requests:                     1000
Benchmark duration (s):                  5.48                                                                                                   
Total input tokens:                      128000
Total generated tokens:                  1000                                                                                                   
Request throughput (req/s):              182.50
Input token throughput (tok/s):          23360.05                                                                                               
Output token throughput (tok/s):         182.50
Total token throughput (tok/s):          23542.55

It also varies in subsequent runs.

zhyncs assigned ispobock and zhyncs Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Different behavior benchmarking w/ request-range-range vs. separate request-rates #2470

[Bug] Different behavior benchmarking w/ request-range-range vs. separate request-rates #2470

Mutinifni commented Dec 12, 2024

Mutinifni commented Dec 16, 2024

[Bug] Different behavior benchmarking w/ request-range-range vs. separate request-rates #2470

[Bug] Different behavior benchmarking w/ request-range-range vs. separate request-rates #2470

Comments

Mutinifni commented Dec 12, 2024

Checklist

Describe the bug

Reproduction

Environment

Mutinifni commented Dec 16, 2024