Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Different behavior benchmarking w/ request-range-range vs. separate request-rates #2470

Open
5 tasks done
Mutinifni opened this issue Dec 12, 2024 · 1 comment
Open
5 tasks done
Assignees

Comments

@Mutinifni
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

Plotting rate-vs-latency graph with request-rate-range vs. individual request rates shows different behavior. Since the bench_serving already includes warmup, I'm not sure why this is happening -- maybe the warmup needs to be more substantial? It is also a bit surprising to see latency spike-then-drop rather than a knee. If something in my methodology is incorrect, please let me know.

Note that I've only tested this on random prompt processing with 128 input size and 1 output size. I ran the benchmarks twice and noticed different behavior both times.

Here are the graphs:
image
image

Reproduction

Model: DeepSeek-V2-Lite

First I ran with infinite request rate to find the max request rate, which was ~165 requests per second. Based on this, I ran the following commands:

Server started using:

model_path="/ssd/models/deepseek-ai/DeepSeek-V2-Lite"
python -m sglang.launch_server \
    --model-path $model_path \
    --log-level critical \
    --trust-remote-code \
    --disable-radix-cache \
    --mem-fraction-static 0.9 \
    --tp-size 8

Request rate range commands:

python -m sglang.bench_serving \
    --disable-tqdm \
    --dataset-name random \
    --multi --request-rate-range 80,260,10 \
    --random-input-len 128 \
    --random-output-len 1 \
    --random-range-ratio 1.0 \
    --num-prompts 1000

Individual request rate commands (server restarted for each request rate value):

python -m sglang.bench_serving \
    --disable-tqdm \
    --dataset-name random \
    --multi --request-rate-range $request_rate \
    --random-input-len 128 \
    --random-output-len 1 \
    --random-range-ratio 1.0 \
    --num-prompts 1000

Environment

[2024-12-12 19:25:01] INFO _client.py:1027: HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_wi
ndow.json "HTTP/1.1 200 OK"
Python: 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-80GB
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.6, V12.6.77
CUDA Driver Version: 565.57.01
PyTorch: 2.5.1+cu124
sglang: 0.4.0.post1
flashinfer: 0.1.6+cu121torch2.4
triton: 3.1.0
transformers: 4.46.3
torchao: 0.6.1
numpy: 1.26.4
aiohttp: 3.11.9
fastapi: 0.115.5
hf_transfer: 0.1.8
huggingface_hub: 0.26.3
interegular: 0.3.3
modelscope: 1.20.1
orjson: 3.10.12
packaging: 24.2
psutil: 6.1.0
pydantic: 2.10.3
multipart: 0.0.19
zmq: 26.2.0
uvicorn: 0.32.1
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.56.1
anthropic: 0.40.0
decord: 0.6.0
NVIDIA Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8   CPU Affinity     NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS    24-47    1               N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS    24-47    1               N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     NODE   0-23     0               N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     NODE   0-23     0               N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS    72-95    3               N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS    72-95    3               N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    SYS    48-71    2               N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    SYS    48-71    2               N/A
NIC0    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS      X      NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC2    SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS      X      NODE    SYS     SYS     SYS     SYS     NODE
NIC3    SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS     SYS     SYS     NODE
NIC4    SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS      X      NODE    SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS      X      NODE    SYS
NIC7    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE     X      SYS
NIC8    SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8


Hypervisor vendor: Microsoft
ulimit soft: 1024
@Mutinifni
Copy link
Author

Likely related: benchmark performance numbers are not very consistent, even with offline benchmarks.

For example, I ran this same command twice, and saw substantial performance difference (~20%):

python -m sglang.bench_offline_throughput --model-path /ssd/models/deepseek-ai/DeepSeek-V2-Lite --trust-remote-code --disable-radix-cache --mem-fraction-static 0.9 --dataset-name random --r
andom-input-len 128 --random-output-len 1 --random-range-ratio 1.0 --num-prompts 1000 --tp-size 8

Successful requests:                     1000
Benchmark duration (s):                  6.48                                                                                                   
Total input tokens:                      128000
Total generated tokens:                  1000                                                                                                   
Request throughput (req/s):              154.30
Input token throughput (tok/s):          19750.00                                                                                               
Output token throughput (tok/s):         154.30
Total token throughput (tok/s):          19904.30                                                                                               
Backend:                                 engine                                                                                                 
Successful requests:                     1000
Benchmark duration (s):                  5.48                                                                                                   
Total input tokens:                      128000
Total generated tokens:                  1000                                                                                                   
Request throughput (req/s):              182.50
Input token throughput (tok/s):          23360.05                                                                                               
Output token throughput (tok/s):         182.50
Total token throughput (tok/s):          23542.55                                                                                               

It also varies in subsequent runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants