Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] mixtral moe fp16 greedy decode output differ each request #2890

Closed
3 tasks done
anaivebird opened this issue Dec 13, 2024 · 12 comments
Closed
3 tasks done

[Bug] mixtral moe fp16 greedy decode output differ each request #2890

anaivebird opened this issue Dec 13, 2024 · 12 comments
Assignees

Comments

@anaivebird
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

mixtral moe fp16 greedy decode output differ each request

run--318430-master-0(main)@~$ python3 debug.py
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 1f3bf321e81c92c9a24e72a30eab3912aa9116a859bab1a96b17319edeb1a9ec
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: bde5c39ef296a278c9f1dc2f976fec597a1f68ce13580442836690941f4cf394
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 10a504ad4aa3a2f3f76c71d8187dc498011ff4f2e51943db6ee3d6215d0f1d7e
run--318430-master-0(main)@~$ python3 debug.py
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 10a504ad4aa3a2f3f76c71d8187dc498011ff4f2e51943db6ee3d6215d0f1d7e
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 2ed1a4627d2ab0e44841c17e7c5cf666bdaea32086efdc3abd933252031c535e
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: dff3577f2aa499d75788f917ad76f81568d9cf2bfbad6bd3bdc1f2b7bf0fb66f
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 41f175778d0a7fdfdd78126f9a2207ace64667a8da49a087fdb4cfc06dc2aae2
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: b2396445b9553b88c4cafaa992e63a71694eb93bb4ddea1a878421ade6c8c443

save input, output differ, but for Qwen/Qwen2-57B-A14B the outputs are same for same input

Reproduction

huggingface-cli download mistralai/Mixtral-8x7B-Instruct-v0.1 --local-dir ./Mixtral-8x7B-Instruct-v0.1
lmdeploy serve api_server /home/work/models/mixtral --backend turbomind  --server-port  23333   --model-name mixtral --tp 2
``


### Environment

```Shell
lmdeploy main latest version
A100 80G * 2

Error traceback

No response

@lvhan028
Copy link
Collaborator

Because it uses continuous batching. The inference shape probably varies in each forward iteration.
If you pursue the exactly same response, --max-batch-size must be set to 1 when launching the serve, and greedy sampling should be enable when sending requests to the server

@anaivebird
Copy link
Author

anaivebird commented Dec 13, 2024

I have changed my scripts.

  • 1. I have enabled greedy sampling(top_k==1, top_p==1, temperature==1)
  • 2. max-batch-size is 1

server

lmdeploy serve api_server /home/work/models/mixtral --backend turbomind  --server-port  23333   --model-name mixtral --tp 2 --max-batch-size 1

client

import requests
import json
import hashlib

def get_hash(s):
    hash_object = hashlib.sha256(s.encode('utf-8'))
    return hash_object.hexdigest()

def get_content(index):
    with open('reverse_item.json', 'r') as file:
        items = json.load(file)
    content = items[index]

    headers = {
        'Content-Type': 'application/json',
    }

    json_data = {
        'model': 'mixtral',
        'messages': [
            {
                'role': 'user',
                'content': content,
            },
        ],
        'max_tokens': 1024,
        'top_k': 1,
        'top_p': 1,
        'temperature': 1,
        'seed': 42,
    }

    response = requests.post('http://localhost:23333/v1/chat/completions', headers=headers, json=json_data)
    response_data = response.json()

    content_to_output = response_data['choices'][0]['message']['content']
    
    # Print hash values
    print(f"Input Hash: {get_hash(content)}, Output Hash: {get_hash(content_to_output)}")
    
    return {
        'input': content,
        'output': content_to_output
    }

indices = [3]
results = []

for _ in range(5):
    for index in indices:
        result = get_content(index)
        results.append(result)

with open('output.json', 'w') as outfile:
    json.dump(results, outfile, indent=4, ensure_ascii=False)

results:

run--318430-master-0(main)@~$ python3 debug.py
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: b2396445b9553b88c4cafaa992e63a71694eb93bb4ddea1a878421ade6c8c443
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
run--318430-master-0(main)@~$ python3 debug.py
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
run--318430-master-0(main)@~$ python3 debug.py
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: b2396445b9553b88c4cafaa992e63a71694eb93bb4ddea1a878421ade6c8c443
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435

two results cb3a230e8df78807c131f0656f33b30b097fa8b14b25d263d5fd8d3feeffa435 and b2396445b9553b88c4cafaa992e63a71694eb93bb4ddea1a878421ade6c8c443 exist, I'm investigating on it

if server restart two output 10a504 and 97437af come out

image

@anaivebird
Copy link
Author

anaivebird commented Dec 13, 2024

Qwen2-57B-A14B-Instruct's output will differ when server restart, but same if not restart server.
Qwen1.5-MoE-A2.7B-Chat's output will differ when server restart, but same if not restart server.

Non-moe model's output keep same even after restart server, so does some code of moe implementation introduce randomness into code?

@lvhan028
Copy link
Collaborator

cc @lzhangzz

@anaivebird
Copy link
Author

anaivebird commented Dec 16, 2024

这一条可以忽略

@lzhangzz
Copy link
Collaborator

没有复现出来,能不能 share 一下测试的 prompt?

还有,lmdeploy 用的是 0.6.4 么?

@anaivebird
Copy link
Author

anaivebird commented Dec 19, 2024

没有复现出来,能不能 share 一下测试的 prompt?

还有,lmdeploy 用的是 0.6.4 么?

感谢回复哈
首先,代码用的是Support Qwen2-MoE models (https://github.com/InternLM/lmdeploy/pull/2723[)](https://github.com/InternLM/lmdeploy/commit/d2d4209d148c09356492a04000a878270896178c)这个commit的

其次复现的prompt我明天发给你,长度在1000tokens左右

然后排查下来是这个函数在logits相同,tokens,padded,param_.expert_num,param_.experts_per_token,param_.norm_topk相同,清空accum_的情况下,每次scales_输出会有diff

https://github.com/InternLM/lmdeploy/blob/d2d4209d148c09356492a04000a878270896178c/src/turbomind/models/llama/moe_ffn_layer.cc

/// TODO: fix illegal memory access even if NaN are present in logits
invokeMoeGate_V2(f2n_,
                 en2f_,
                 offsets_,
                 scales_,
                 masks_,
                 accum_,
                 logits_,
                 tokens,
                 padded,
                 param_.expert_num,
                 param_.experts_per_token,
                 param_.norm_topk,
                 stream_);

@anaivebird
Copy link
Author

image
是这个地方之前的代码越界了,修复后不重启服务输出就不会变,但是重启server输出还是会变,需要继续排查

@lzhangzz
Copy link
Collaborator

这个地方 0.6.4 已经修了

第一次起服务时用 TM_GEMM_EXPORT=tm_cache 导出 tuning 结果,后续用 TM_GEMM_IMPORT=tm_cache 导入,结果应该就不会变了

@anaivebird
Copy link
Author

第一次起服务时用 TM_GEMM_EXPORT=tm_cache 的结果是a
后续用 TM_GEMM_IMPORT=tm_cache 导入启动服务多次,结果都是b
但是a不等于b
当然到这里问题已经不是很大了,感谢!

TM_GEMM_EXPORT=tm_cache

$ python debug.py
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: aa5ef3215dc518bdcafd6ece86beac455bfb22c55fc0d1a05245a55a943a0e7d
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: aa5ef3215dc518bdcafd6ece86beac455bfb22c55fc0d1a05245a55a943a0e7d
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: aa5ef3215dc518bdcafd6ece86beac455bfb22c55fc0d1a05245a55a943a0e7d
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: aa5ef3215dc518bdcafd6ece86beac455bfb22c55fc0d1a05245a55a943a0e7d

TM_GEMM_IMPORT=tm_cache 第一次

$ python debug.py
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 11f99af7d9995966a6a18fd12cdd6dd9068d78bf2133897149d47bf876f27887
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 11f99af7d9995966a6a18fd12cdd6dd9068d78bf2133897149d47bf876f27887

TM_GEMM_IMPORT=tm_cache 第二次

$ python debug.py
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 11f99af7d9995966a6a18fd12cdd6dd9068d78bf2133897149d47bf876f27887
Input Hash: 8ffa1ad131a5cbe407e9cb0ee9733643a072adb70d7db6cfdc55b5dc30408536, Output Hash: 11f99af7d9995966a6a18fd12cdd6dd9068d78bf2133897149d47bf876f27887

@lzhangzz
Copy link
Collaborator

TP 的时候不同 rank 可能 tune 出来结果不一样,但只有 rank-0 会存下来,所以下次 import 结果可能会和 tune 的时候不一样

@anaivebird
Copy link
Author

谢谢各位大佬,问题已解决

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants