Potential problem with the evaluation #5

shan23chen · 2025-01-31T00:00:27Z

Hey we just went through your MedCalc-Bench paper, this is really cool work.

Using the same llama3 8B instruct model open-ended 0 shot, we actually archive 79.66 acc instead suggesting there might be some potential problem on the current check_correctness function in evaluation/evaluate.py.

Could you guys potentially look at this by chance?

Attachment is all outputs and the validation on the results.

Best,
Shan Chen

og_3.csv

The text was updated successfully, but these errors were encountered:

nikhilk7153 · 2025-01-31T01:50:35Z

Hi,

Thanks for your comment. I just wanted to ask if you used the exact prompts and settings that we have provided? Looking at your outputs, it seems that it is not giving a JSON and so it seems that you are running inference in a different setting than what we have provided. Have you also used a different check_correctness implementation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential problem with the evaluation #5

Potential problem with the evaluation #5

shan23chen commented Jan 31, 2025

nikhilk7153 commented Jan 31, 2025 •

edited

Loading

Potential problem with the evaluation #5

Potential problem with the evaluation #5

Comments

shan23chen commented Jan 31, 2025

nikhilk7153 commented Jan 31, 2025 • edited Loading

nikhilk7153 commented Jan 31, 2025 •

edited

Loading