New tasks and evaluation code for LLMs #160

OyvindTafjord · 2023-12-14T21:15:53Z

This PR includes a number of additions associated with LLM evaluations:

New tasks under catwalk/dependencies/lm_eval: arc_easy:mc, arc_challenge:mc, case_hold:mc, csqa, eurlex, naturalqs_short_open, unfair_tos, scitldr, social_iqa, xsum

New detailed metrics for ranked classification in RankedClassificationMetrics.

New task for perplexity scoring over a set of jsonl files (catwalk/tasks/perplexity_jsonl.py)

New model type "lm::" for the general types of tasks handled by current decoder-only LLMs (catwalk/models/language_model.py).

Script for running LLM evaluations (catwalk/run_lm_eval.py).

Add eval script with more verbose output and on-the-fly models

Logging updates

Add option to show a few model inputs

Fix missing file for num_model_inputs

Add full MC prompts for ARC

Adding summarization datasets, scitldr and xsum

Verbose perplexity evaluation

Fix prefix space issues with certain tokenizers

huggingface/datasets#6352

Fix bug in loading datasets

Add siqa and csqa tasks

Olmo eval cleanup

OyvindTafjord and others added 30 commits April 12, 2023 17:44

Add support for model kwargs, like revision

20997fd

Fix doc string

dc01384

Remove relative_improvement metric, not so useful

fff997a

Add get_instances method

3d207d8

Old __main__.py with extra functionality like richer output

08b1f1c

Add number of instances in output

a95f788

Merge pull request #1 from OyvindTafjord/add-revisions

41514a3

Add eval script with more verbose output and on-the-fly models

Set log level to INFO, save predictions after each task

416226d

Minor logging fix

c9f81b4

Make metrics.json a single json object

ca5bf1c

Sort the list of tasks so it's not set-randomness

1fdc47b

Show metrics results after each task

93e9b2c

Fix for MultiRC evaluation (possibly)

71765ec

Merge remote-tracking branch 'upstream/main' into add-revisions

c11f1b3

Merge pull request #2 from OyvindTafjord/add-revisions

f0d3b76

Logging updates

Add option to show a few model inputs

b017f8c

Merge pull request #3 from OyvindTafjord/log-inputs

8ea4e88

Add option to show a few model inputs

Fix missing file for num_model_inputs

be53724

Merge pull request #4 from OyvindTafjord/log-inputs

ff1dccb

Fix missing file for num_model_inputs

Add full MC prompts for ARC

40df13b

Merge pull request #5 from OyvindTafjord/add-arc-mc

f75fd39

Add full MC prompts for ARC

Rename num_model_inputs to num_recorded_inputs

88667a7

Add custom MultipleChoiceMetrics class

ab2e5dc

Add default unconditioned prompts

f456858

Add MultipleChoiceMetrics class

139250c

Accommodate new metrics style in calculate metrics

84b7b71

Add lm:: type model with richer return value

1e94382

Non-tango version of steps used in run_lm_eval

7e83aba

Script for running lm:: style models

7612232

Support metrics overrides in classification task

43e1ed0

armancohan and others added 17 commits October 5, 2023 16:36

Merge pull request #140 from allenai/armanc/scitldr

cd70f0e

Adding summarization datasets, scitldr and xsum

update requirements

0fc014a

Merge remote-tracking branch 'upstream/olmo-eval'

c0f3742

Add token-level stats to ppl eval

c071329

Merge pull request #155 from OyvindTafjord/ppl-verbose

3005895

Verbose perplexity evaluation

Add hack for tokenizers with auto prefix space

934b6a1

Fix polarity bug

8558449

Strip prefix space for prefix-space tokenizers

a3d49c7

Merge pull request #156 from OyvindTafjord/fix-llama-tokenizer-space

576443d

Fix prefix space issues with certain tokenizers

Fix bug in loading datasets

3c63cc6

huggingface/datasets#6352

Merge pull request #157 from OyvindTafjord/datasets-version

4a901b0

Fix bug in loading datasets

Add siqa and csqa tasks

200d34c

Merge pull request #158 from allenai/add-siqa

cfca1bf

Add siqa and csqa tasks

Remove unnecessary changes from main branch

fe7f618

Delete unwanted files

4498931

Merge pull request #159 from allenai/olmo-eval-cleanup

ade3caf

Olmo eval cleanup

remove pycountry, add iso639

1506491

AkshitaB requested review from dirkgr and AkshitaB December 18, 2023 20:05

AkshitaB added 10 commits December 19, 2023 12:53

fix mypy issues

9d7e38d

add isort and black

f0546ad

add style check

dd00a4b

update changelog

700610d

fix

4b0ec72

update import

2aa8b5f

add blank line

a9da573

last fix?

e914345

docstrings

82e0909

weird error in docs

62f56eb

AkshitaB merged commit b9cc7df into main Dec 19, 2023
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New tasks and evaluation code for LLMs #160

New tasks and evaluation code for LLMs #160

OyvindTafjord commented Dec 14, 2023

New tasks and evaluation code for LLMs #160

New tasks and evaluation code for LLMs #160

Conversation

OyvindTafjord commented Dec 14, 2023