Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential problem with the evaluation #5

Open
shan23chen opened this issue Jan 31, 2025 · 1 comment
Open

Potential problem with the evaluation #5

shan23chen opened this issue Jan 31, 2025 · 1 comment

Comments

@shan23chen
Copy link

Hey we just went through your MedCalc-Bench paper, this is really cool work.

Using the same llama3 8B instruct model open-ended 0 shot, we actually archive 79.66 acc instead suggesting there might be some potential problem on the current check_correctness function in evaluation/evaluate.py.

Could you guys potentially look at this by chance?

Attachment is all outputs and the validation on the results.

Best,
Shan Chen

og_3.csv

@nikhilk7153
Copy link
Collaborator

nikhilk7153 commented Jan 31, 2025

Hi,

Thanks for your comment. I just wanted to ask if you used the exact prompts and settings that we have provided? Looking at your outputs, it seems that it is not giving a JSON and so it seems that you are running inference in a different setting than what we have provided. Have you also used a different check_correctness implementation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants