[Bug] Eagle2 has an unstable sampling rate during multi concurrency。 #2537

coolhok · 2024-12-21T06:42:48Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

definition req_avg_sample_tokens = completion_tokens / completion_tokens_wo_jump_forward
1）Repeat the same request 10 times，completion_tokens and req_avg_sample_tokens Very stable。

client log
completion_tokens=493, sample_tokens = 2.581151832460733
completion_tokens=493, sample_tokens = 2.581151832460733
completion_tokens=493, sample_tokens = 2.581151832460733
completion_tokens=493, sample_tokens = 2.581151832460733
completion_tokens=493, sample_tokens = 2.581151832460733
completion_tokens=493, sample_tokens = 2.581151832460733
completion_tokens=493, sample_tokens = 2.581151832460733
completion_tokens=493, sample_tokens = 2.581151832460733
completion_tokens=493, sample_tokens = 2.581151832460733
completion_tokens=493, sample_tokens = 2.581151832460733

2）Open 2 terminals，two requests occur simultaneously ,completion_tokens and req_avg_sample_tokens Very unstable,Not meeting the expected sampling rate and concurrency.

client 1 log
completion_tokens=493, sample_tokens = 2.581151832460733
completion_tokens=493, sample_tokens = 2.581151832460733
completion_tokens=435, sample_tokens = 2.377049180327869
completion_tokens=499, sample_tokens = 2.3990384615384617
completion_tokens=450, sample_tokens = 2.393617021276596
completion_tokens=493, sample_tokens = 2.5412371134020617
completion_tokens=450, sample_tokens = 2.4861878453038675
completion_tokens=435, sample_tokens = 2.4166666666666665
completion_tokens=435, sample_tokens = 2.443820224719101
completion_tokens=435, sample_tokens = 2.403314917127072

client 2 log
completion_tokens=448, sample_tokens = 2.448087431693989
completion_tokens=435, sample_tokens = 2.403314917127072
completion_tokens=435, sample_tokens = 2.3262032085561497
completion_tokens=435, sample_tokens = 2.364130434782609
completion_tokens=435, sample_tokens = 2.3138297872340425
completion_tokens=493, sample_tokens = 2.465
completion_tokens=435, sample_tokens = 2.403314917127072
completion_tokens=435, sample_tokens = 2.3513513513513513
completion_tokens=435, sample_tokens = 2.3513513513513513
completion_tokens=493, sample_tokens = 2.581151832460733

Reproduction

apiserver

python3 -m sglang.launch_server --model /mnt/data/model_hub/Qwen2-7B-Instruct --max-prefill-tokens 16384  --trust-remote-code --tp 1 --dp 1  --mem-fraction-static 0.5 --draft-model-path /mnt/data/model_hub/EAGLE-Qwen2-7B-Instruct --num-speculative-steps 4 --eagle-topk 2 --num-draft-tokens 8 --speculative-algorithm EAGLE --disable-radix-cache

clinet

import openai

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")

for _ in range(10):
    response = client.chat.completions.create(
        model="/mnt/data/model_hub/Qwen2-7B-Instruct/",
        messages=[
            {"role": "user", "content": "What are the mental triggers in Jeff Walker's Product Launch Formula and \"Launch\" book?"},
        ],
        temperature=0,
        max_tokens=1024,
    )
    completion_tokens = response.usage.completion_tokens
    completion_tokens_wo_jump_forward = response.usage.completion_tokens_wo_jump_forward
    print(f"{completion_tokens=}, sample_tokens = {completion_tokens/completion_tokens_wo_jump_forward}")

Environment

env

Python: 3.10.13 (main, Oct 7 2024, 19:00:16) [GCC 11.4.0]
CUDA available: True
GPU 0 Compute Capability: 8.0
NVCC: Cuda compilation tools, release 12.3, V12.3.
CUDA Driver Version: 1.1.0-caab6d
PyTorch: 2.3.0
sglang: 0.3.4.post2
flashinfer: 0.1.6
triton: 2.2.0
transformers: 4.45.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.25.2
aiohttp: 3.10.10
fastapi: 0.115.2
hf_transfer: Module Not Found
huggingface_hub: 0.26.0
interegular: 0.3.3
packaging: 21.3
PIL: 10.4.0
psutil: 6.1.0
pydantic: 2.9.2
uvicorn: 0.32.0
uvloop: 0.21.0
zmq: 26.2.0
vllm: 0.6.3.dev202+gd47bfb0e.d20241023
multipart: 0.0.12
openai: 1.52.0
anthropic: Module Not Found
litellm: Module Not Found
NVIDIA Topology:
GPU0 CPU Affinity NUMA Affinity
GPU0 X 0-15 0

Legend:
Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM
ulimit soft: 102400

ps we find commit "bceff076caa266ea2543ba9237f5eafcd8770ffd" code is fast than laster code

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Eagle2 has an unstable sampling rate during multi concurrency。 #2537

[Bug] Eagle2 has an unstable sampling rate during multi concurrency。 #2537

coolhok commented Dec 21, 2024 •

edited

Loading

[Bug] Eagle2 has an unstable sampling rate during multi concurrency。 #2537

[Bug] Eagle2 has an unstable sampling rate during multi concurrency。 #2537

Comments

coolhok commented Dec 21, 2024 • edited Loading

Checklist

Describe the bug

Reproduction

apiserver

clinet

Environment

env

coolhok commented Dec 21, 2024 •

edited

Loading