You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there!
First of all, thank you for open sourcing your great work!
I am working on reproducing some of the evaluation in the paper.
Although most of the hyper-parameters can be found in the repo, there are still something confusing me:
For the evaluation of pass rates at different generation lengths in Figure 2 of DeepSeek-R1-1.5B, which experimental setting was used: generating responses separately for each generation length with a corresponding maximum generation length limit, or generating responses once with the maximum generation length (32768) and then truncating the responses to different lengths for evaluation?
I noticed that neither your evaluation script nor DeepScaleR’s (now called rllm) evaluation script explicitly specifies a seed. During reproduction, I observed significant differences in evaluation results compared to DeepScaleR. On some benchmarks, my pass@1 over 16 samples is much lower, while on others it is significantly higher. There are also notable differences compared to your results. Aside from hardware-related differences, I noticed that in vLLM v0.7.3, the default seed is 0, whereas in v0.8.x versions, the default is None. This could lead to unexpected results. May I ask which version of vLLM you used for evaluation, or what seed was used?
For Table 1 in the paper, what generation parameters settings do you use?
Thanks again for your awesome work! Looking forward to your reply.
The text was updated successfully, but these errors were encountered:
Hi there!
First of all, thank you for open sourcing your great work!
I am working on reproducing some of the evaluation in the paper.
Although most of the hyper-parameters can be found in the repo, there are still something confusing me:
For the evaluation of pass rates at different generation lengths in Figure 2 of DeepSeek-R1-1.5B, which experimental setting was used: generating responses separately for each generation length with a corresponding maximum generation length limit, or generating responses once with the maximum generation length (32768) and then truncating the responses to different lengths for evaluation?
I noticed that neither your evaluation script nor DeepScaleR’s (now called rllm) evaluation script explicitly specifies a
seed
. During reproduction, I observed significant differences in evaluation results compared to DeepScaleR. On some benchmarks, my pass@1 over 16 samples is much lower, while on others it is significantly higher. There are also notable differences compared to your results. Aside from hardware-related differences, I noticed that in vLLM v0.7.3, the defaultseed
is 0, whereas in v0.8.x versions, the default isNone
. This could lead to unexpected results. May I ask which version of vLLM you used for evaluation, or whatseed
was used?For Table 1 in the paper, what generation parameters settings do you use?
Thanks again for your awesome work! Looking forward to your reply.
The text was updated successfully, but these errors were encountered: