You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
5. Please use English, otherwise it will be closed.
Describe the bug
definition req_avg_sample_tokens = completion_tokens / completion_tokens_wo_jump_forward
1)Repeat the same request 10 times,completion_tokens and req_avg_sample_tokens Very stable。
2)Open 2 terminals,two requests occur simultaneously ,completion_tokens and req_avg_sample_tokens Very unstable,Not meeting the expected sampling rate and concurrency.
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
for _ in range(10):
response = client.chat.completions.create(
model="/mnt/data/model_hub/Qwen2-7B-Instruct/",
messages=[
{"role": "user", "content": "What are the mental triggers in Jeff Walker's Product Launch Formula and \"Launch\" book?"},
],
temperature=0,
max_tokens=1024,
)
completion_tokens = response.usage.completion_tokens
completion_tokens_wo_jump_forward = response.usage.completion_tokens_wo_jump_forward
print(f"{completion_tokens=}, sample_tokens = {completion_tokens/completion_tokens_wo_jump_forward}")
Environment
env
Python: 3.10.13 (main, Oct 7 2024, 19:00:16) [GCC 11.4.0]
CUDA available: True
GPU 0 Compute Capability: 8.0
NVCC: Cuda compilation tools, release 12.3, V12.3.
CUDA Driver Version: 1.1.0-caab6d
PyTorch: 2.3.0
sglang: 0.3.4.post2
flashinfer: 0.1.6
triton: 2.2.0
transformers: 4.45.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.25.2
aiohttp: 3.10.10
fastapi: 0.115.2
hf_transfer: Module Not Found
huggingface_hub: 0.26.0
interegular: 0.3.3
packaging: 21.3
PIL: 10.4.0
psutil: 6.1.0
pydantic: 2.9.2
uvicorn: 0.32.0
uvloop: 0.21.0
zmq: 26.2.0
vllm: 0.6.3.dev202+gd47bfb0e.d20241023
multipart: 0.0.12
openai: 1.52.0
anthropic: Module Not Found
litellm: Module Not Found
NVIDIA Topology:
GPU0 CPU Affinity NUMA Affinity
GPU0 X 0-15 0
Legend:
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Hypervisor vendor: KVM
ulimit soft: 102400
ps we find commit "bceff076caa266ea2543ba9237f5eafcd8770ffd" code is fast than laster code
The text was updated successfully, but these errors were encountered:
Checklist
Describe the bug
definition req_avg_sample_tokens = completion_tokens / completion_tokens_wo_jump_forward
1)Repeat the same request 10 times,completion_tokens and req_avg_sample_tokens Very stable。
2)Open 2 terminals,two requests occur simultaneously ,completion_tokens and req_avg_sample_tokens Very unstable,Not meeting the expected sampling rate and concurrency.
Reproduction
apiserver
clinet
Environment
env
Python: 3.10.13 (main, Oct 7 2024, 19:00:16) [GCC 11.4.0]
CUDA available: True
GPU 0 Compute Capability: 8.0
NVCC: Cuda compilation tools, release 12.3, V12.3.
CUDA Driver Version: 1.1.0-caab6d
PyTorch: 2.3.0
sglang: 0.3.4.post2
flashinfer: 0.1.6
triton: 2.2.0
transformers: 4.45.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.25.2
aiohttp: 3.10.10
fastapi: 0.115.2
hf_transfer: Module Not Found
huggingface_hub: 0.26.0
interegular: 0.3.3
packaging: 21.3
PIL: 10.4.0
psutil: 6.1.0
pydantic: 2.9.2
uvicorn: 0.32.0
uvloop: 0.21.0
zmq: 26.2.0
vllm: 0.6.3.dev202+gd47bfb0e.d20241023
multipart: 0.0.12
openai: 1.52.0
anthropic: Module Not Found
litellm: Module Not Found
NVIDIA Topology:
GPU0 CPU Affinity NUMA Affinity
GPU0 X 0-15 0
Legend:
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Hypervisor vendor: KVM
ulimit soft: 102400
ps we find commit "bceff076caa266ea2543ba9237f5eafcd8770ffd" code is fast than laster code
The text was updated successfully, but these errors were encountered: