Inference output format does not match reward function #489

ryan-minato · 2025-03-07T09:51:18Z

I noticed that in actual service, DeepSeek R1 does not seem to use the <think>...</think><answer>...</answer> format. Instead, only the reasoning process is enclosed within the <think> tag, while the final output is placed directly after </think>. This format discrepancy also appears to be reflected in the reasoning trajectories found in open-r1/OpenR1-Math-220k.

Should the format reward function be modified to only match the content within the <think> tag, rather than expecting the <answer> tag?

https://github.com/huggingface/open-r1/blob/6660a477eca71bf8d94c59cd2e458cf0ff6e1f80/src/open_r1/rewards.py#L67C1-L72C56

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference output format does not match reward function #489

Inference output format does not match reward function #489

ryan-minato commented Mar 7, 2025

Inference output format does not match reward function #489

Inference output format does not match reward function #489

Comments

ryan-minato commented Mar 7, 2025