Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New tasks and evaluation code for LLMs #160

Merged
merged 146 commits into from
Dec 19, 2023
Merged

New tasks and evaluation code for LLMs #160

merged 146 commits into from
Dec 19, 2023

Conversation

OyvindTafjord
Copy link
Contributor

This PR includes a number of additions associated with LLM evaluations:

New tasks under catwalk/dependencies/lm_eval: arc_easy:mc, arc_challenge:mc, case_hold:mc, csqa, eurlex, naturalqs_short_open, unfair_tos, scitldr, social_iqa, xsum

New detailed metrics for ranked classification in RankedClassificationMetrics.

New task for perplexity scoring over a set of jsonl files (catwalk/tasks/perplexity_jsonl.py)

New model type "lm::" for the general types of tasks handled by current decoder-only LLMs (catwalk/models/language_model.py).

Script for running LLM evaluations (catwalk/run_lm_eval.py).

OyvindTafjord and others added 30 commits April 12, 2023 17:44
Add eval script with more verbose output and on-the-fly models
Add option to show a few model inputs
Fix missing file for num_model_inputs
@AkshitaB AkshitaB merged commit b9cc7df into main Dec 19, 2023
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants