Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

如何提高生成语音的速度? #96

Open
6 tasks done
caixianyu opened this issue Jul 9, 2024 · 3 comments
Open
6 tasks done

如何提高生成语音的速度? #96

caixianyu opened this issue Jul 9, 2024 · 3 comments
Labels
help wanted Extra attention is needed question Further information is requested upstream Dependency on upstream fixes

Comments

@caixianyu
Copy link

阅读 README.md 和 dependencies.md

  • 我已经阅读过 README.md 和 dependencies.md 文件

检索 issue 和 discussion

  • 我已经确认之前没有 issue 或 discussion 涉及此 BUG

检查 Forge 版本

  • 我已经确认问题发生在最新代码或稳定版本中

请确认是否与 API 无关?

  • 我已经确认问题与 API 无关

请确认是否与 WebUI 无关?

  • 我已经确认问题与 WebUI 无关

请确认是否与 Fintune 无关?

  • 我已经确认问题与 Fintune 无关

你的issues

有没有什么建议或者办法可以尝试?代码优化,模型优化,硬件设备增加? 我们想做一个在线实时语音回复。目前一句简单几个字的回复,需要的时间太长了。

@zhzLuke96
Copy link
Member

如果是长时间持续服务的话,可以开启 --compile 基本可以提高 1.5 倍以上,但是可能需要自行触发shape预热预编译,并且预热速度比较慢
还有如果是新显卡 (Turing、Ampere、Ada、Hopper架构) 可以安装 flash_attn 并启动服务时使用 --flash_attn 参数开启加速,也能获得提升

其次,你可以试试开启流式生成,目前流式生成已经基本可用,且可中断

@zhzLuke96 zhzLuke96 added the question Further information is requested label Jul 9, 2024
@caixianyu
Copy link
Author

启用Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the with torch.autocast(device_type='torch_device'): decorator, or load the model with the torch_dtype argument. Example: model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16),有这个告警。同时启用--compile无法启动。
单独启用--compile ,自行触发shape预热预编译是什么意思,是指第一次生成语音比较慢对吗? 真正跑起来也没感觉快很多。

api 通过curl 调用,开启流式,产生的mp3文件是怎么流式获取?

@zhzLuke96
Copy link
Member

zhzLuke96 commented Jul 10, 2024

启用Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the with torch.autocast(device_type='torch_device'): decorator, or load the model with the torch_dtype argument. Example: model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16),有这个告警。同时启用--compile无法启动。 单独启用--compile ,自行触发shape预热预编译是什么意思,是指第一次生成语音比较慢对吗? 真正跑起来也没感觉快很多。

api 通过curl 调用,开启流式,产生的mp3文件是怎么流式获取?

  • flash_attn 这个报错,有点奇怪,按道理说默认是开启半精度。这块的逻辑官方也才刚刚更新,我也才移植过来没几天,可能还有问题,得排查下 [ISSUE] flash_attn f16 warning #97

  • “同时启用--compile无法启动” 具体是什么报错?

  • 预热的意思就是第一次生成很慢,因为要编译,所以可以手动请求触发编译

  • 关于流式生成,开启流式和普通的接口对于请求端没有太大的区别,都是读取一样的编码结构

@zhzLuke96 zhzLuke96 added help wanted Extra attention is needed upstream Dependency on upstream fixes labels Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested upstream Dependency on upstream fixes
Projects
None yet
Development

No branches or pull requests

2 participants