-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] mixtral moe fp16 greedy decode output differ each request #2890
Comments
Because it uses continuous batching. The inference shape probably varies in each forward iteration. |
I have changed my scripts.
server
client
results:
two results cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435 and b2396445b9553b88c4cafaa992e63a71694eb93bb4ddea1a878421ade6c8c443 exist, I'm investigating on it if server restart two output 10a504 and 97437af come out |
Qwen2-57B-A14B-Instruct's output will differ when server restart, but same if not restart server. Non-moe model's output keep same even after restart server, so does some code of moe implementation introduce randomness into code? |
cc @lzhangzz |
这一条可以忽略 |
没有复现出来,能不能 share 一下测试的 prompt? 还有,lmdeploy 用的是 0.6.4 么? |
感谢回复哈 其次复现的prompt我明天发给你,长度在1000tokens左右 然后排查下来是这个函数在logits相同,tokens,padded,param_.expert_num,param_.experts_per_token,param_.norm_topk相同,清空accum_的情况下,每次scales_输出会有diff
|
这个地方 0.6.4 已经修了 第一次起服务时用 TM_GEMM_EXPORT=tm_cache 导出 tuning 结果,后续用 TM_GEMM_IMPORT=tm_cache 导入,结果应该就不会变了 |
第一次起服务时用 TM_GEMM_EXPORT=tm_cache 的结果是a TM_GEMM_EXPORT=tm_cache
TM_GEMM_IMPORT=tm_cache 第一次
TM_GEMM_IMPORT=tm_cache 第二次
|
TP 的时候不同 rank 可能 tune 出来结果不一样,但只有 rank-0 会存下来,所以下次 import 结果可能会和 tune 的时候不一样 |
谢谢各位大佬,问题已解决 |
Checklist
Describe the bug
mixtral moe fp16 greedy decode output differ each request
save input, output differ, but for Qwen/Qwen2-57B-A14B the outputs are same for same input
Reproduction
Error traceback
No response
The text was updated successfully, but these errors were encountered: