If you've ever wondered how to make sure an LLM performs well on your specific task, this guide is for you! It covers the different ways you can evaluate a model, guides on designing your own evaluations, and tips and tricks from practical experience.
Whether working with production models, a researcher or a hobbyist, I hope you'll find what you need; and if not, open an issue (to suggest ameliorations or missing resources) and I'll complete the guide!
- Beginner user:
If you don't know anything about evaluation, you should start by the
Basics
sections in each chapter before diving deeper. You'll also find explanations to support you about important LLM topics inGeneral knowledge
: for example, how model inference works and what tokenization is. - Advanced user:
The more practical sections are the
Tips and Tricks
ones, andTroubleshooting
chapter. You'll also find interesting things in theDesigning
sections.
In text, links prefixed by ⭐ are links I really enjoyed and recommend reading.
If you want an intro on the topic, you can read this blog on how and why we do evaluation!
- Basics
- Getting a Judge-LLM
- Designing your evaluation prompt
- Evaluating your evaluator
- What about reward models
- Tips and tricks
The most densely practical part of this guide.
These are mostly beginner guides to LLM basics, but will still contain some tips and cool references!
If you're an advanced user, I suggest skimming to the Going further
sections.
You'll also find examples as jupyter notebooks, to get a more hands on experience of evaluation if that's how you learn!
- Comparing task formulations during evaluation: This notebook walks you through how to define prompt variations for a single task, run the evaluations, and analyse the results.
- contents/automated-benchmarks/Metrics -> Description of automatic metrics
- contents/Introduction: Why do we need to do evaluation?
- contents/Thinking about evaluation: What are the high level things you always need to consider when building your task?
- contents/Troubleshooting/Troubleshooting ranking: Why comparing models is hard
Links I like
This guide has been heavily inspired by the ML Engineering Guidebook by Stas Bekman! Thanks for this cool resource!
Many thanks also to all the people who inspired this guide through discussions either at events or online, notably and not limited to:
- 🤝 Luca Soldaini, Kyle Lo and Ian Magnusson (Allen AI), Max Bartolo (Cohere), Kai Wu (Meta), Swyx and Alessio Fanelli (Latent Space Podcast), Hailey Schoelkopf (EleutherAI), Martin Signoux (Open AI), Moritz Hardt (Max Planck Institute), Ludwig Schmidt (Anthropic)
- 🔥 community users of the Open LLM Leaderboard and lighteval, who often raised very interesting points in discussions
- 🤗 people at Hugging Face, like Lewis Tunstall, Omar Sanseviero, Arthur Zucker, Hynek Kydlíček, Guilherme Penedo and Thom Wolf,
- of course my team ❤️ doing evaluation and leaderboards, Nathan Habib and Alina Lozovskaya.
@misc{fourrier2024evaluation,
author = {Clémentine Fourrier and The Hugging Face Community},
title = {LLM Evaluation Guidebook},
year = {2024},
journal = {GitHub repository},
url = {https://github.com/huggingface/evaluation-guidebook)
}