Skip to content

Details about evaluation. #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
HBin013 opened this issue Apr 21, 2025 · 0 comments
Open

Details about evaluation. #16

HBin013 opened this issue Apr 21, 2025 · 0 comments

Comments

@HBin013
Copy link

HBin013 commented Apr 21, 2025

Hi there!
First of all, thank you for open sourcing your great work!
I am working on reproducing some of the evaluation in the paper.
Although most of the hyper-parameters can be found in the repo, there are still something confusing me:

  1. For the evaluation of pass rates at different generation lengths in Figure 2 of DeepSeek-R1-1.5B, which experimental setting was used: generating responses separately for each generation length with a corresponding maximum generation length limit, or generating responses once with the maximum generation length (32768) and then truncating the responses to different lengths for evaluation?

  2. I noticed that neither your evaluation script nor DeepScaleR’s (now called rllm) evaluation script explicitly specifies a seed. During reproduction, I observed significant differences in evaluation results compared to DeepScaleR. On some benchmarks, my pass@1 over 16 samples is much lower, while on others it is significantly higher. There are also notable differences compared to your results. Aside from hardware-related differences, I noticed that in vLLM v0.7.3, the default seed is 0, whereas in v0.8.x versions, the default is None. This could lead to unexpected results. May I ask which version of vLLM you used for evaluation, or what seed was used?

  3. For Table 1 in the paper, what generation parameters settings do you use?

Thanks again for your awesome work! Looking forward to your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant