如何提高生成语音的速度？ #96

caixianyu · 2024-07-09T09:27:58Z

阅读 README.md 和 dependencies.md

我已经阅读过 README.md 和 dependencies.md 文件

检索 issue 和 discussion

我已经确认之前没有 issue 或 discussion 涉及此 BUG

检查 Forge 版本

我已经确认问题发生在最新代码或稳定版本中

请确认是否与 API 无关？

我已经确认问题与 API 无关

请确认是否与 WebUI 无关？

我已经确认问题与 WebUI 无关

请确认是否与 Fintune 无关？

我已经确认问题与 Fintune 无关

你的issues

有没有什么建议或者办法可以尝试？代码优化，模型优化，硬件设备增加？我们想做一个在线实时语音回复。目前一句简单几个字的回复，需要的时间太长了。

zhzLuke96 · 2024-07-09T11:45:56Z

如果是长时间持续服务的话，可以开启 --compile 基本可以提高 1.5 倍以上，但是可能需要自行触发shape预热预编译，并且预热速度比较慢
还有如果是新显卡 (Turing、Ampere、Ada、Hopper架构) 可以安装 flash_attn 并启动服务时使用 --flash_attn 参数开启加速，也能获得提升

其次，你可以试试开启流式生成，目前流式生成已经基本可用，且可中断

caixianyu · 2024-07-09T13:04:49Z

启用Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the with torch.autocast(device_type='torch_device'): decorator, or load the model with the torch_dtype argument. Example: model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)，有这个告警。同时启用--compile无法启动。
单独启用--compile ，自行触发shape预热预编译是什么意思，是指第一次生成语音比较慢对吗？真正跑起来也没感觉快很多。

api 通过curl 调用，开启流式，产生的mp3文件是怎么流式获取？

zhzLuke96 · 2024-07-10T06:49:55Z

启用Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the with torch.autocast(device_type='torch_device'): decorator, or load the model with the torch_dtype argument. Example: model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)，有这个告警。同时启用--compile无法启动。单独启用--compile ，自行触发shape预热预编译是什么意思，是指第一次生成语音比较慢对吗？真正跑起来也没感觉快很多。

api 通过curl 调用，开启流式，产生的mp3文件是怎么流式获取？

flash_attn 这个报错，有点奇怪，按道理说默认是开启半精度。这块的逻辑官方也才刚刚更新，我也才移植过来没几天，可能还有问题，得排查下 [ISSUE] flash_attn f16 warning #97
“同时启用--compile无法启动” 具体是什么报错？
预热的意思就是第一次生成很慢，因为要编译，所以可以手动请求触发编译
关于流式生成，开启流式和普通的接口对于请求端没有太大的区别，都是读取一样的编码结构

zhzLuke96 added the question Further information is requested label Jul 9, 2024

zhzLuke96 mentioned this issue Jul 10, 2024

[ISSUE] flash_attn f16 warning #97

Open

zhzLuke96 added help wanted Extra attention is needed upstream Dependency on upstream fixes labels Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

如何提高生成语音的速度？ #96

如何提高生成语音的速度？ #96

caixianyu commented Jul 9, 2024

zhzLuke96 commented Jul 9, 2024

caixianyu commented Jul 9, 2024

zhzLuke96 commented Jul 10, 2024 •

edited

Loading

如何提高生成语音的速度？ #96

如何提高生成语音的速度？ #96

Comments

caixianyu commented Jul 9, 2024

阅读 README.md 和 dependencies.md

检索 issue 和 discussion

检查 Forge 版本

请确认是否与 API 无关？

请确认是否与 WebUI 无关？

请确认是否与 Fintune 无关？

你的issues

zhzLuke96 commented Jul 9, 2024

caixianyu commented Jul 9, 2024

zhzLuke96 commented Jul 10, 2024 • edited Loading

zhzLuke96 commented Jul 10, 2024 •

edited

Loading