Is there any way to reduce few seconds delay before getting first chunk using server with streaming mode? #853

declider · 2025-01-24T12:43:39Z

declider
Jan 24, 2025

I am using build-in http server (python -m tools.api_server) and getting responses through code like this:

response = requests.post(
	"http://127.0.0.1:8080/v1/tts",
	data=ormsgpack.packb(pydantic_data, option=ormsgpack.OPT_SERIALIZE_PYDANTIC),
	stream=True,
	headers={
		"content-type": "application/msgpack",
	},
)


t1 = time()
t2 = None

if response.status_code == 200:
	try:
		for chunk in response.iter_content(chunk_size=1024):
			if not t5:
				t5 = time()
				print("Time before first chunk",t2 - t1)
			if chunk:
				stream.write(chunk)
	finally:
		stream.close()
		p.terminate()
else:
	print(f"Request failed with status code {response.status_code}")
	print(response.json())

It is what i get from api_client code.

And this Time before first chunk print always show number in range 3.5-6 seconds for few medium sentences text and around 1.5-3 seconds for 1 sentence text. Is there any way to decrease it?

Otherwise speed is good.
RTX 3060 12gb, Cuda 12.4, Torch 2.5.1, --compile flag enabled, model is quantizied (merged with lora, but it same with stock model), inference speed in range 90-110it/s.
Everything is fast except this few seconds delay.

What i tried:

--compile enabled
model is quantized
chunk_size 1024/2048/4096/8192 works pretty much same in my case
chunk_length less than 200 make result worse
use_memory_cache toggle doesn't change delay
delay same with or without any references

Don-Chad · 2025-01-24T12:55:36Z

Don-Chad
Jan 24, 2025

Great question! 3-5 seconds makes you think it might not be streaming really..

How did you quantize if I may ask?

2 replies

declider Jan 24, 2025
Author

How did you quantize if I may ask?

through tools/llama/quantize.py in int8 mode

Don-Chad Jan 25, 2025

awesome ,got it! I played around some and got the whole architecture to stream to optimize for time to first audio. The VQGAN is super quick already Getting (220 it's on the VQGAN on a a 3090, so that is 4ms per token), it's the text2semantic which is just slow, even just for one word it takes 600+ms. total operation around a second.

I can see text2semantic takes really big chunks, like 8 words. but even for a single word it is slow. Any ideas to speed this up besides quantizing and torch compiling?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any way to reduce few seconds delay before getting first chunk using server with streaming mode? #853

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is there any way to reduce few seconds delay before getting first chunk using server with streaming mode? #853

declider Jan 24, 2025

Replies: 1 comment · 2 replies

Don-Chad Jan 24, 2025

declider Jan 24, 2025 Author

Don-Chad Jan 25, 2025

declider
Jan 24, 2025

Replies: 1 comment 2 replies

Don-Chad
Jan 24, 2025

declider Jan 24, 2025
Author