You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey we just went through your MedCalc-Bench paper, this is really cool work.
Using the same llama3 8B instruct model open-ended 0 shot, we actually archive 79.66 acc instead suggesting there might be some potential problem on the current check_correctness function in evaluation/evaluate.py.
Could you guys potentially look at this by chance?
Attachment is all outputs and the validation on the results.
Thanks for your comment. I just wanted to ask if you used the exact prompts and settings that we have provided? Looking at your outputs, it seems that it is not giving a JSON and so it seems that you are running inference in a different setting than what we have provided. Have you also used a different check_correctness implementation?
Hey we just went through your MedCalc-Bench paper, this is really cool work.
Using the same llama3 8B instruct model open-ended 0 shot, we actually archive 79.66 acc instead suggesting there might be some potential problem on the current
check_correctness
function inevaluation/evaluate.py
.Could you guys potentially look at this by chance?
Attachment is all outputs and the validation on the results.
Best,
Shan Chen
og_3.csv
The text was updated successfully, but these errors were encountered: