Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Unreleased

Fixed

Added torch.no_grad() around model calls in language_model.py
Prevent crashes with more robust stop token for greedy_until in language_model.py

v1.0.0rc0 - 2023-12-19

Added

Support for OPT-175B (AI2 only)
New detailed metrics for ranked classification in RankedClassificationMetrics.
New task for perplexity scoring over a set of jsonl files.
New model type "lm:" for general types of tasks handled by decoder-only language models.
run_lm_eval.py script.

Fixed

Fixed the way we compute SQuAD metrics.
Fixed wikitext on GPT2
Fixed lambada on GPT2
Fixed the implementation of MultiRC

v0.2.2 - 2023-01-27

Changed

Changed the package name to ai2-catwalk to avoid a name conflict on Pypi.

v0.2.1 - 2023-01-26

Fixed

Fixed the release process

v0.2.0 - 2022-12-02

Changed

MetaICLTask now supports fewshots less than 16 and only support getting the test split
set default logging level to "WARNING" instead of "ERROR" when invoking python -m catwalk
changed MetaICLModel formatting to always preserve whitespace, to reproduce MetaICL results
improved speed of rank classification models by aggregating sequence logits on GPU rather than on CPU
The promptsource templates now live directly inside of Catwalk. This avoids dependency issues.
Promptsource now applies the templates in parallel across all CPUs.
Replaced a dependency on lmeval with a copy of the source code

Added

Adds the ability to train models
Few-shot abilities
P3 tasks
Encoder-only QA models
SQuAD and SQuADShifts tasks
Adds a new MetaICLTask that supports all evaluation tasks in that benchmark
Adds a new MetaICLModel that replicates the formatting and truncation used by MetaICL for few shot evaluation
An option for rank classification to average log likelihoods by token length
Adds support for inference with IA3 adaptors loaded from a file on decoder only ranked classification models
Add support for MetaICL's race-high and numer_sense tasks
Adds QA task support for autoregressive (previously only available with Eleuther task format)
Adds QA task support for T5 models
Optional random_subsample_seed for PredictStep
An option for rank classification to average log likelihoods by token length
Added MRQA task
Adds support for inference with IA3 adapters loaded from a file on decoder only ranked classification models
Added the ability to train HFAutoModel
Added the ability for HFAutoModel to run NLI tasks
Adds ability to backoff to auto device_map on out of memory error for ranked classification models
Format conversions for a number of multiple choice models
Added an experiment config that trains many models on many tasks
Added promptsource support
Added support for soft prompts
Added more models, T0 variants of T5 and Eleuther variants of GPT
Added support for Huggingface's accelerate project, but only for inference
Promptsource now supports few-shot ICL.
The training step now supports early stopping.
Compatibility with the latest version of torchmetrics

Fixed

Fixed progress bar for HFAutoModel QA evaluation
Fixed bug causing few-shot to use more than specified number of shots
Fixed bug in cached_transformer.get() that prevented using override_weights_file arg
Fixed the load_weights arg in cached_transformers.get() which was documented but not implemented
Fixed support for training with OPT models
Countless tweaks to FinetuneStep
Some models insert special tokens where they should not. This fixes that.
Metrics were messy for classification tasks. They are still messy, but less so.
Applied workaround for T5 bug in huggingface tokenizers
Fixed fine-tuning T5 ranked classification models
Fixed the names of the T5 1.1 models
Cached transformers now take kwargs into account.
Fixed various tasks: WSC, TriviaQA, Race, HeadQA
Fixed the case where different promptsource templates produce different numbers of answer choices
tqdm has to be closed or it'll start printing a bunch of newlines.

v0.1.0 - 2022-06-10

Changed

Catwalk is now Open Source and has a release process.

v0.0.0

Added

Catwalk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Changelog

Unreleased

Fixed

v1.0.0rc0 - 2023-12-19

Added

Fixed

v0.2.2 - 2023-01-27

Changed

v0.2.1 - 2023-01-26

Fixed

v0.2.0 - 2022-12-02

Changed

Added

Fixed

v0.1.0 - 2022-06-10

Changed

v0.0.0

Added

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

Unreleased

Fixed

v1.0.0rc0 - 2023-12-19

Added

Fixed

v0.2.2 - 2023-01-27

Changed

v0.2.1 - 2023-01-26

Fixed

v0.2.0 - 2022-12-02

Changed

Added

Fixed

v0.1.0 - 2022-06-10

Changed

v0.0.0

Added