You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given the answers obtained in CORE-345 and the ground-truth answers of the reference dataset, we should determine, for each question, how well the two match.
We will adopt an LLM-as-a-judge approach, using three different LLMs and a vote by majority to decide whether the answer is satisfactory.
We will follow the process below:
Consider the input question, ground truth answer, and generated answer
Prompt the LLM to judge whether the generated answer correctly answer the question, given the ground-truth answer. Note that CRAG allows answers of type "I don’t know" and "invalid question", which should be accounted for in the evaluation
Repeat for the other LLMs (can do in parallel)
Decide whether the answer is good or not by majority vote of the different judges
Compute global metric (fraction of correct answers), and the metric by
domain
question_type
answer type ("valid", "invalid", "no_answer")
The text was updated successfully, but these errors were encountered:
Given the answers obtained in CORE-345 and the ground-truth answers of the reference dataset, we should determine, for each question, how well the two match.
We will adopt an LLM-as-a-judge approach, using three different LLMs and a vote by majority to decide whether the answer is satisfactory.
We will follow the process below:
The text was updated successfully, but these errors were encountered: