Details about evaluation. #16

HBin013 · 2025-04-21T15:43:16Z

Hi there!
First of all, thank you for open sourcing your great work!
I am working on reproducing some of the evaluation in the paper.
Although most of the hyper-parameters can be found in the repo, there are still something confusing me:

For the evaluation of pass rates at different generation lengths in Figure 2 of DeepSeek-R1-1.5B, which experimental setting was used: generating responses separately for each generation length with a corresponding maximum generation length limit, or generating responses once with the maximum generation length (32768) and then truncating the responses to different lengths for evaluation?
I noticed that neither your evaluation script nor DeepScaleR’s (now called rllm) evaluation script explicitly specifies a seed. During reproduction, I observed significant differences in evaluation results compared to DeepScaleR. On some benchmarks, my pass@1 over 16 samples is much lower, while on others it is significantly higher. There are also notable differences compared to your results. Aside from hardware-related differences, I noticed that in vLLM v0.7.3, the default seed is 0, whereas in v0.8.x versions, the default is None. This could lead to unexpected results. May I ask which version of vLLM you used for evaluation, or what seed was used?
For Table 1 in the paper, what generation parameters settings do you use?

Thanks again for your awesome work! Looking forward to your reply.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Details about evaluation. #16

Details about evaluation. #16

HBin013 commented Apr 21, 2025

Details about evaluation. #16

Details about evaluation. #16

Comments

HBin013 commented Apr 21, 2025