You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
And my training dynamics are as follows (I slightly modify the config to remove format reward as I observed that it stays at zero during training):
The accuracy indeed increases a lot, which suggests the training process may be correct. However, I note that the completion length keeps decreasing, while it is expected to increase after around 20 training steps, as reported in SimpleRL-Reason.
Can anyone explain about this?
The text was updated successfully, but these errors were encountered:
Me too. Format reward keeps 0 all the way. But actually the format reward of this one is different from the format reward of SimpleRL-Reason. Is the difference the reason?
I have run the following commands as shown in Readme to reproduce the results in SimpleRL-Reason.
And my training dynamics are as follows (I slightly modify the config to remove format reward as I observed that it stays at zero during training):
The accuracy indeed increases a lot, which suggests the training process may be correct. However, I note that the completion length keeps decreasing, while it is expected to increase after around 20 training steps, as reported in SimpleRL-Reason.
Can anyone explain about this?
The text was updated successfully, but these errors were encountered: