This repo curates recent research on LLM Judges for automated evaluation.
Tip
⚖️ Check out Verdict — our in-house library for hassle-free implementations of the papers below!
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment
- Benchmarking Foundation Models with Language-Model-as-an-Examiner
- ScaleEval: Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate
- ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
- Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via Debate
- Debating with More Persuasive LLMs Leads to More Truthful Answers
- Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
- Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
- JudgeLM: Fine-tuned Large Language Models are Scalable Judges
- HALU-J: Critique-Based Hallucination Judge
- MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents
- Lynx: An Open Source Hallucination Evaluation Model
- A STRONGREJECT for Empty Jailbreaks (Sections C.4 & C.5)
- OR-Bench: An Over-Refusal Benchmark for Large Language Models (Sections A.3 & A.11)
- WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
- Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
- On Scalable Oversight with Weak LLMs Judging Strong LLMs
- Debate Helps Supervise Unreliable Experts
- Great Models Think Alike and this Undermines AI Oversight
- LLM Critics Help Catch LLM Bugs
- JudgeBench: A Benchmark for Evaluating LLM-based Judges
- RewardBench: Evaluating Reward Models for Language Modeling
- Evaluating Large Language Models at Evaluating Instruction Following
- Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
- From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
- Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences
- Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
- The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs
- ReIFE: Re-evaluating Instruction-Following Evaluation
- Large Language Models are not Fair Evaluators
- Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions
- Large Language Models are Inconsistent and Biased Evaluators
- Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
🚧 Coming Soon -- Stay tuned!
Have a paper to add? Found a mistake? 🧐
- Open a pull request or submit an issue! Contributions are welcome. 🙌
- Questions? Reach out to [email protected].