-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent results between HuggingFace Transformers and vllm #1069
Comments
I got similar issue. can't reproduce the results from HF inference with all default parameters but topk=30 topp=0.75 maxtoken=1024 |
I got the similar problem with baichuan-13B |
1 similar comment
I got the similar problem with baichuan-13B |
nope. I can not reproduce the resultscompared to HF running greedy decoding |
same problem with yi-34b-chat(3 quantized models: official yi-34b-chat-4bits, awq and gptq versions from TheBloke) |
Still not resolved from what I'm seeing |
I observed a discrepancy between Hugging Face and vLLM. I'm currently using version 0.3.0 due to NCCL issues (which I'm working on resolving). In my tests with Mistral and Mixtral8x7b models, I found discrepancies when using the bfloat16 data type. While both vLLM and Hugging Face results seem reasonable, shouldn't we be getting identical outputs with the same settings (no sampling, topk=1, etc.)? Interestingly, switching the data type to float16 produces identical results in both cases. |
This issue has been persisting for half a year without being resolved or even identified as the cause, which is quite frustrating. |
How should I set the api parameters to achieve the same effect as using vllm_model.generate_greedy method in test_models.py? |
same problem when I use qwen1.5 ,very different between HuggingFace Transformers and vllm |
How is this supposed to help? vllm provides invaluably better code than hf, but I noticed that the outputs of the models are of a lower quality most of the time, to the point that it becomes unusable. Are we doing something wrong? If not, is there any plan to look into this? |
Same issue here. I have compared vllm 0.6.1.post2 vs transformer pipeline greedy decoding result (bf16), but it appears different. Does anyone have experience on producing exact result for greedy decoding? Or it's just unachievable... |
I have figured out my issue finally after a couple days so I am commenting here in case this helps anyone in the future. The issue for me was that if your model has a 'generation_config.json', huggingface will override its default sampling arguments with it. vLLM ignores this config. This affected my greedy sampling too because of repetition penalty I had in the config. Once I copied the parameters from the models generation_config.json, it was way closer (But not exact, likely just due to the different implementations of things) |
@hidude562 that's very helpful! I added it in #10805 , thanks! |
I'm getting inconsistent results between HF and vllm with llama2-7b running greedy decoding:
HF version:
which yields:
vllm version:
which yields:
The text was updated successfully, but these errors were encountered: