You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In our current gpu-benchmarking scripts, we always use the prompt "Hi Hi Hi ..." to test model performance. deepeseek-coder-7b model always returns a long enough response until the maximum length. Therefore, we can simply generating desired length of input by changing number of "Hi" in our requests and usemax-length parameter in query to get the desired output length.
However, we found that this benchmarking method doesn't work in the 33b model, which returns a very short response for such a prompt, which means our current benchmarking strategy is no longer working.
We need to improve our exiting benchmarking script to make it general enough to work on any model. The current idea is that
We need to create a dataset and send all prompts there to the model and records their corresponding response length.
Write program to filter different input-output pattern prompts from them and use the filtered prompts for benchmarking tests.
Automate above process and run it before we run our current benchmarking script.
Steps to Reproduce
Expected behavior
We expect to use real prompts with different input-output pattern for benchmarking tests.
Environment
-LLM used: deepseek-coder-33b
The text was updated successfully, but these errors were encountered:
🐛 Describe the bug
In our current gpu-benchmarking scripts, we always use the prompt "Hi Hi Hi ..." to test model performance. deepeseek-coder-7b model always returns a long enough response until the maximum length. Therefore, we can simply generating desired length of input by changing number of "Hi" in our requests and use
max-length
parameter in query to get the desired output length.However, we found that this benchmarking method doesn't work in the 33b model, which returns a very short response for such a prompt, which means our current benchmarking strategy is no longer working.
We need to improve our exiting benchmarking script to make it general enough to work on any model. The current idea is that
Steps to Reproduce
Expected behavior
We expect to use real prompts with different input-output pattern for benchmarking tests.
Environment
-LLM used: deepseek-coder-33b
The text was updated successfully, but these errors were encountered: