Retrieval + generation eval: metrics #3563

jacopo-chevallard · 2025-01-28T09:24:29Z

Given the answers obtained in CORE-345 and the ground-truth answers of the reference dataset, we should determine, for each question, how well the two match.

We will adopt an LLM-as-a-judge approach, using three different LLMs and a vote by majority to decide whether the answer is satisfactory.

We will follow the process below:

Consider the input question, ground truth answer, and generated answer
Prompt the LLM to judge whether the generated answer correctly answer the question, given the ground-truth answer. Note that CRAG allows answers of type "I don’t know" and "invalid question", which should be accounted for in the evaluation
Repeat for the other LLMs (can do in parallel)
Decide whether the answer is good or not by majority vote of the different judges
Compute global metric (fraction of correct answers), and the metric by
- domain
- question_type
- answer type ("valid", "invalid", "no_answer")

linear · 2025-01-28T09:24:30Z

CORE-340 Retrieval + generation: mertrics

jacopo-chevallard self-assigned this Jan 28, 2025

jacopo-chevallard changed the title ~~Retrieval + generation: mertrics~~ Retrieval + generation eval: metrics Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieval + generation eval: metrics #3563

Retrieval + generation eval: metrics #3563

jacopo-chevallard commented Jan 28, 2025 •

edited

Loading

linear bot commented Jan 28, 2025

Retrieval + generation eval: metrics #3563

Retrieval + generation eval: metrics #3563

Comments

jacopo-chevallard commented Jan 28, 2025 • edited Loading

linear bot commented Jan 28, 2025

jacopo-chevallard commented Jan 28, 2025 •

edited

Loading