[Bug] mixtral moe fp16 greedy decode output differ each request #2890

anaivebird · 2024-12-13T03:55:28Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

mixtral moe fp16 greedy decode output differ each request

run--318430-master-0(main)@~$ python3 debug.py
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 1f3bf321e81c92c9a24e72a30eab3912aa9116a859bab1a96b17319edeb1a9ec
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: bde5c39ef296a278c9f1dc2f976fec597a1f68ce13580442836690941f4cf394
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 10a504ad4aa3a2f3f76c71d8187dc498011ff4f2e51943db6ee3d6215d0f1d7e
run--318430-master-0(main)@~$ python3 debug.py
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 10a504ad4aa3a2f3f76c71d8187dc498011ff4f2e51943db6ee3d6215d0f1d7e
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 2ed1a4627d2ab0e44841c17e7c5cf666bdaea32086efdc3abd933252031c535e
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: dff3577f2aa499d75788f917ad76f81568d9cf2bfbad6bd3bdc1f2b7bf0fb66f
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 41f175778d0a7fdfdd78126f9a2207ace64667a8da49a087fdb4cfc06dc2aae2
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: b2396445b9553b88c4cafaa992e63a71694eb93bb4ddea1a878421ade6c8c443

save input, output differ, but for Qwen/Qwen2-57B-A14B the outputs are same for same input

Reproduction

huggingface-cli download mistralai/Mixtral-8x7B-Instruct-v0.1 --local-dir ./Mixtral-8x7B-Instruct-v0.1
lmdeploy serve api_server /home/work/models/mixtral --backend turbomind  --server-port  23333   --model-name mixtral --tp 2
``


### Environment

```Shell
lmdeploy main latest version
A100 80G * 2

Error traceback

No response

The text was updated successfully, but these errors were encountered:

lvhan028 · 2024-12-13T04:00:40Z

Because it uses continuous batching. The inference shape probably varies in each forward iteration.
If you pursue the exactly same response, --max-batch-size must be set to 1 when launching the serve, and greedy sampling should be enable when sending requests to the server

anaivebird · 2024-12-13T05:10:57Z

I have changed my scripts.

1. I have enabled greedy sampling(top_k==1, top_p==1, temperature==1)
2. max-batch-size is 1

server

lmdeploy serve api_server /home/work/models/mixtral --backend turbomind  --server-port  23333   --model-name mixtral --tp 2 --max-batch-size 1

client

import requests
import json
import hashlib

def get_hash(s):
    hash_object = hashlib.sha256(s.encode('utf-8'))
    return hash_object.hexdigest()

def get_content(index):
    with open('reverse_item.json', 'r') as file:
        items = json.load(file)
    content = items[index]

    headers = {
        'Content-Type': 'application/json',
    }

    json_data = {
        'model': 'mixtral',
        'messages': [
            {
                'role': 'user',
                'content': content,
            },
        ],
        'max_tokens': 1024,
        'top_k': 1,
        'top_p': 1,
        'temperature': 1,
        'seed': 42,
    }

    response = requests.post('http://localhost:23333/v1/chat/completions', headers=headers, json=json_data)
    response_data = response.json()

    content_to_output = response_data['choices'][0]['message']['content']
    
    # Print hash values
    print(f"Input Hash: {get_hash(content)}, Output Hash: {get_hash(content_to_output)}")
    
    return {
        'input': content,
        'output': content_to_output
    }

indices = [3]
results = []

for _ in range(5):
    for index in indices:
        result = get_content(index)
        results.append(result)

with open('output.json', 'w') as outfile:
    json.dump(results, outfile, indent=4, ensure_ascii=False)

results:

run--318430-master-0(main)@~$ python3 debug.py
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: b2396445b9553b88c4cafaa992e63a71694eb93bb4ddea1a878421ade6c8c443
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
run--318430-master-0(main)@~$ python3 debug.py
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
run--318430-master-0(main)@~$ python3 debug.py
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: b2396445b9553b88c4cafaa992e63a71694eb93bb4ddea1a878421ade6c8c443
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435

two results cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435 and b2396445b9553b88c4cafaa992e63a71694eb93bb4ddea1a878421ade6c8c443 exist, I'm investigating on it

if server restart two output 10a504 and 97437af come out

anaivebird · 2024-12-13T06:16:01Z

Qwen2-57B-A14B-Instruct's output will differ when server restart, but same if not restart server.
Qwen1.5-MoE-A2.7B-Chat's output will differ when server restart, but same if not restart server.

Non-moe model's output keep same even after restart server, so does some code of moe implementation introduce randomness into code?

lvhan028 · 2024-12-13T07:43:23Z

cc @lzhangzz

anaivebird · 2024-12-16T10:30:12Z

这一条可以忽略

lzhangzz · 2024-12-19T13:08:32Z

没有复现出来，能不能 share 一下测试的 prompt？

还有，lmdeploy 用的是 0.6.4 么？

anaivebird · 2024-12-19T17:43:29Z

没有复现出来，能不能 share 一下测试的 prompt？

还有，lmdeploy 用的是 0.6.4 么？

感谢回复哈
首先，代码用的是Support Qwen2-MoE models (https://github.com/InternLM/lmdeploy/pull/2723[)](https://github.com/InternLM/lmdeploy/commit/d2d4209d148c09356492a04000a878270896178c)这个commit的

其次复现的prompt我明天发给你，长度在1000tokens左右

然后排查下来是这个函数在logits相同，tokens,padded,param_.expert_num,param_.experts_per_token,param_.norm_topk相同，清空accum_的情况下，每次scales_输出会有diff

https://github.com/InternLM/lmdeploy/blob/d2d4209d148c09356492a04000a878270896178c/src/turbomind/models/llama/moe_ffn_layer.cc

/// TODO: fix illegal memory access even if NaN are present in logits
invokeMoeGate_V2(f2n_,
                 en2f_,
                 offsets_,
                 scales_,
                 masks_,
                 accum_,
                 logits_,
                 tokens,
                 padded,
                 param_.expert_num,
                 param_.experts_per_token,
                 param_.norm_topk,
                 stream_);

anaivebird · 2024-12-20T06:13:58Z

是这个地方之前的代码越界了，修复后不重启服务输出就不会变，但是重启server输出还是会变，需要继续排查

lzhangzz · 2024-12-20T15:16:16Z

这个地方 0.6.4 已经修了

第一次起服务时用 TM_GEMM_EXPORT=tm_cache 导出 tuning 结果，后续用 TM_GEMM_IMPORT=tm_cache 导入，结果应该就不会变了

anaivebird · 2024-12-21T08:10:05Z

第一次起服务时用 TM_GEMM_EXPORT=tm_cache 的结果是a
后续用 TM_GEMM_IMPORT=tm_cache 导入启动服务多次，结果都是b
但是a不等于b
当然到这里问题已经不是很大了，感谢！

TM_GEMM_EXPORT=tm_cache

$ python debug.py
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: aa5ef3215dc518bdcafd6ece86beac455bfb22c55fc0d1a05245a55a943a0e7d
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: aa5ef3215dc518bdcafd6ece86beac455bfb22c55fc0d1a05245a55a943a0e7d
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: aa5ef3215dc518bdcafd6ece86beac455bfb22c55fc0d1a05245a55a943a0e7d
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: aa5ef3215dc518bdcafd6ece86beac455bfb22c55fc0d1a05245a55a943a0e7d

TM_GEMM_IMPORT=tm_cache 第一次

$ python debug.py
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 11f99af7d9995966a6a18fd12cdd6dd9068d78bf2133897149d47bf876f27887
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 11f99af7d9995966a6a18fd12cdd6dd9068d78bf2133897149d47bf876f27887

TM_GEMM_IMPORT=tm_cache 第二次

$ python debug.py
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 11f99af7d9995966a6a18fd12cdd6dd9068d78bf2133897149d47bf876f27887
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 11f99af7d9995966a6a18fd12cdd6dd9068d78bf2133897149d47bf876f27887

lzhangzz · 2024-12-21T19:26:53Z

TP 的时候不同 rank 可能 tune 出来结果不一样，但只有 rank-0 会存下来，所以下次 import 结果可能会和 tune 的时候不一样

anaivebird · 2024-12-25T01:45:45Z

谢谢各位大佬，问题已解决

lvhan028 added the awaiting response label Dec 13, 2024

lvhan028 self-assigned this Dec 13, 2024

anaivebird closed this as completed Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] mixtral moe fp16 greedy decode output differ each request #2890

[Bug] mixtral moe fp16 greedy decode output differ each request #2890

anaivebird commented Dec 13, 2024

lvhan028 commented Dec 13, 2024

anaivebird commented Dec 13, 2024 •

edited

Loading

anaivebird commented Dec 13, 2024 •

edited

Loading

lvhan028 commented Dec 13, 2024

anaivebird commented Dec 16, 2024 •

edited

Loading

lzhangzz commented Dec 19, 2024

anaivebird commented Dec 19, 2024 •

edited

Loading

anaivebird commented Dec 20, 2024

lzhangzz commented Dec 20, 2024

anaivebird commented Dec 21, 2024

lzhangzz commented Dec 21, 2024

anaivebird commented Dec 25, 2024

[Bug] mixtral moe fp16 greedy decode output differ each request #2890

[Bug] mixtral moe fp16 greedy decode output differ each request #2890

Comments

anaivebird commented Dec 13, 2024

Checklist

Describe the bug

Reproduction

Error traceback

lvhan028 commented Dec 13, 2024

anaivebird commented Dec 13, 2024 • edited Loading

I have changed my scripts.

anaivebird commented Dec 13, 2024 • edited Loading

lvhan028 commented Dec 13, 2024

anaivebird commented Dec 16, 2024 • edited Loading

lzhangzz commented Dec 19, 2024

anaivebird commented Dec 19, 2024 • edited Loading

anaivebird commented Dec 20, 2024

lzhangzz commented Dec 20, 2024

anaivebird commented Dec 21, 2024

lzhangzz commented Dec 21, 2024

anaivebird commented Dec 25, 2024

anaivebird commented Dec 13, 2024 •

edited

Loading

anaivebird commented Dec 13, 2024 •

edited

Loading

anaivebird commented Dec 16, 2024 •

edited

Loading

anaivebird commented Dec 19, 2024 •

edited

Loading