You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Correct me if I'm wrong, most of these problems should not have negative solutions, but I see over a hundred of negative target values. The gsm8k file only has 2 negative examples.
Thanks
The text was updated successfully, but these errors were encountered:
Hi @Madd0g ,
Thank you for your interest in our work!
Your observation is correct. Since the GSM-Hard benchmark was created automatically, it may contain negative target values or "unnatural" positive values.
Unfortunately, we do not have the resources to manually annotate all examples, so our assumption is that there is a penalty of 5%-10% drop in performance for all models and prompting approaches that are evaluated on this benchmark. Since this penalty is similar to all approaches, we believe that the relative comparison between different approaches is the right thing to measure.
Correct me if I'm wrong, most of these problems should not have negative solutions, but I see over a hundred of negative
target
values. The gsm8k file only has 2 negative examples.Thanks
The text was updated successfully, but these errors were encountered: