Replies: 1 comment 2 replies
-
Great question! 3-5 seconds makes you think it might not be streaming really.. How did you quantize if I may ask? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am using build-in http server (
python -m tools.api_server
) and getting responses through code like this:It is what i get from
api_client
code.And this
Time before first chunk
print always show number in range 3.5-6 seconds for few medium sentences text and around 1.5-3 seconds for 1 sentence text. Is there any way to decrease it?Otherwise speed is good.
RTX 3060 12gb, Cuda 12.4, Torch 2.5.1,
--compile
flag enabled, model is quantizied (merged with lora, but it same with stock model), inference speed in range 90-110it/s.Everything is fast except this few seconds delay.
What i tried:
--compile
enabledchunk_size
1024/2048/4096/8192 works pretty much same in my casechunk_length
less than 200 make result worseuse_memory_cache
toggle doesn't change delayBeta Was this translation helpful? Give feedback.
All reactions