You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that in actual service, DeepSeek R1 does not seem to use the <think>...</think><answer>...</answer> format. Instead, only the reasoning process is enclosed within the <think> tag, while the final output is placed directly after </think>. This format discrepancy also appears to be reflected in the reasoning trajectories found in open-r1/OpenR1-Math-220k.
Should the format reward function be modified to only match the content within the <think> tag, rather than expecting the <answer> tag?
I noticed that in actual service, DeepSeek R1 does not seem to use the
<think>...</think><answer>...</answer>
format. Instead, only the reasoning process is enclosed within the<think>
tag, while the final output is placed directly after</think>
. This format discrepancy also appears to be reflected in the reasoning trajectories found in open-r1/OpenR1-Math-220k.Should the format reward function be modified to only match the content within the
<think>
tag, rather than expecting the<answer>
tag?https://github.com/huggingface/open-r1/blob/6660a477eca71bf8d94c59cd2e458cf0ff6e1f80/src/open_r1/rewards.py#L67C1-L72C56
The text was updated successfully, but these errors were encountered: