Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieval + generation eval: metrics #3563

Open
jacopo-chevallard opened this issue Jan 28, 2025 — with Linear · 1 comment
Open

Retrieval + generation eval: metrics #3563

jacopo-chevallard opened this issue Jan 28, 2025 — with Linear · 1 comment
Assignees

Comments

Copy link
Collaborator

jacopo-chevallard commented Jan 28, 2025

Given the answers obtained in CORE-345 and the ground-truth answers of the reference dataset, we should determine, for each question, how well the two match.

We will adopt an LLM-as-a-judge approach, using three different LLMs and a vote by majority to decide whether the answer is satisfactory.

We will follow the process below:

  • Consider the input question, ground truth answer, and generated answer
  • Prompt the LLM to judge whether the generated answer correctly answer the question, given the ground-truth answer. Note that CRAG allows answers of type "I don’t know" and "invalid question", which should be accounted for in the evaluation
  • Repeat for the other LLMs (can do in parallel)
  • Decide whether the answer is good or not by majority vote of the different judges
  • Compute global metric (fraction of correct answers), and the metric by
    • domain
    • question_type
    • answer type ("valid", "invalid", "no_answer")
@jacopo-chevallard jacopo-chevallard self-assigned this Jan 28, 2025
Copy link

linear bot commented Jan 28, 2025

@jacopo-chevallard jacopo-chevallard changed the title Retrieval + generation: mertrics Retrieval + generation eval: metrics Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant