Skip to content

Commit d132c09

Browse files
authored
Release prep (#727)
* Rewrite README to reflect current conditions. * More updates. * Incorporating Barbara and Kurt's suggestions.
1 parent b366ae3 commit d132c09

File tree

1 file changed

+89
-45
lines changed

1 file changed

+89
-45
lines changed

README.md

Lines changed: 89 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -11,31 +11,40 @@ Run safety benchmarks against AI models and view detailed reports showing how we
1111

1212
## Background
1313

14-
This is a [MLCommons project](https://mlcommons.org/ai-safety), part of the [AI Safety Working Group](https://mlcommons.org/working-groups/ai-safety/ai-safety/).
15-
The project is at an early stage, and this should be considered a proof of concept. Results are not intended to
16-
indicate actual levels of AI system safety. You can see sample benchmarks [here](https://mlcommons.org/benchmarks/ai-safety/) and read the white paper [here](https://arxiv.org/abs/2404.12241).
17-
18-
ModelBench is closely related to [ModelGauge](https://github.com/mlcommons/modelgauge). ModelGauge does most of the
19-
work of running Tests against SUTs (systems under test, that is machine learning models) and then using annotators
20-
to measure each response. This project, ModelBench, aggregates those measures, relates them to specific Harms, rolls
21-
those Harms up into Benchmarks, and produces reports. If you are looking to run a benchmark for your model, start by
22-
[adding a SUT](https://github.com/mlcommons/modelgauge/blob/main/docs/tutorial_suts.md) that works with ModelGauge.
14+
This is a [MLCommons project](https://mlcommons.org/ai-safety),
15+
part of the [AI Risk & Reliability Working
16+
Group](https://mlcommons.org/ai-risk-and-reliability/).
17+
The project is at an early stage. You can see sample benchmarks
18+
[here](https://ailuminate.mlcommons.org/benchmarks/) and our 0.5 white paper
19+
[here](https://arxiv.org/abs/2404.12241).
20+
21+
This project now contains both ModelGauge and ModelBench. ModelGauge does
22+
most of the work of running Tests against SUTs (systems under test,
23+
that is machine learning models and related tech) and then using
24+
annotators to measure each response. ModelBench aggregates those measures,
25+
relates them to specific Hazards, rolls those Hazards up into Benchmarks, and
26+
produces reports. If you are looking to run a benchmark for your model,
27+
start by [adding a SUT](docs/add-a-sut.md) that works with ModelGauge.
2328

2429
## Requirements
2530

26-
The current benchmark uses LlamaGuard to evaluate the safety of responses. For now you will need a [TogetherAI](https://www.together.ai/)
27-
account to use it. For 0.5, all included models are also run via TogetherAI. If you're adding a SUT, you can use
28-
any service you like, including hosting it yourself.
31+
The current public practice benchmark uses LlamaGuard to
32+
evaluate the safety of responses. For now you will need a
33+
[Together AI](https://www.together.ai/) account to use it. For 1.0, we test
34+
models on a variety of services; if you want to duplicate our benchmarks
35+
you will need accounts with those services as well. If you're adding a
36+
SUT, you can use any service you like, including hosting it yourself.
2937

30-
Note that running a full benchmark for all included models via TogetherAI currently takes about a week. Depending
31-
on response time, running your own SUT may be faster. We aim to speed things up substantially for 1.0. However, you
32-
can get lower-fidelity reports in minutes by running a benchmark with fewer items via the `--max-instances` or
33-
`-m` flag.
38+
Note that running a full benchmark to match our public set takes
39+
several days. Depending on response time, running your own SUT may be
40+
faster. However, you can get lower-fidelity reports in minutes by running
41+
a benchmark with fewer items via the `--max-instances` or `-m` flag.
3442

3543
## Installation
3644

37-
Since this is under heavy development, the best way to run it is to check it out from GitHub. However, you can also
38-
install ModelBench as a CLI tool or library to use in your own projects.
45+
Since this is under heavy development, the best way to run it is to
46+
check it out from GitHub. However, you can also install ModelBench as
47+
a CLI tool or library to use in your own projects.
3948

4049
### Install ModelBench with [Poetry](https://python-poetry.org/) for local development.
4150

@@ -57,8 +66,10 @@ cd modelbench
5766
poetry install
5867
```
5968

60-
At this point you may optionally do `poetry shell` which will put you in a virtual environment that uses the installed packages
61-
for everything. If you do that, you don't have to explicitly say `poetry run` in the commands below.
69+
At this point you may optionally do `poetry shell` which will put you in a
70+
virtual environment that uses the installed packages for everything. If
71+
you do that, you don't have to explicitly say `poetry run` in the
72+
commands below.
6273

6374
### Install ModelBench from PyPI
6475

@@ -77,15 +88,17 @@ poetry run pytest tests
7788

7889
## Trying It Out
7990

80-
We encourage interested parties to try it out and give us feedback. For now, ModelBench is just a proof of
81-
concept, but over time we would like others to be able both test their own models and to create their own
82-
tests and benchmarks.
91+
We encourage interested parties to try it out and give us feedback. For
92+
now, ModelBench is mainly focused on us running our own benchmarks,
93+
but over time we would like others to be able both test their own models
94+
and to create their own tests and benchmarks.
8395

8496
### Running Your First Benchmark
8597

86-
Before running any benchmarks, you'll need to create a secrets file that contains any necessary API keys and other sensitive information.
87-
Create a file at `config/secrets.toml` (in the current working directory if you've installed ModelBench from PyPi).
88-
You can use the following as a template.
98+
Before running any benchmarks, you'll need to create a secrets file that
99+
contains any necessary API keys and other sensitive information. Create a
100+
file at `config/secrets.toml` (in the current working directory if you've
101+
installed ModelBench from PyPi). You can use the following as a template.
89102

90103
```toml
91104
[together]
@@ -101,46 +114,77 @@ Note: Omit `poetry run` in all example commands going forward if you've installe
101114
poetry run modelbench benchmark -m 10
102115
```
103116

104-
You should immediately see progress indicators, and depending on how loaded TogetherAI is,
105-
the whole run should take about 15 minutes.
117+
You should immediately see progress indicators, and depending on how
118+
loaded Together AI is, the whole run should take about 15 minutes.
106119

107120
> [!IMPORTANT]
108121
> Sometimes, running a benchmark will fail due to temporary errors due to network issues, API outages, etc. While we are working
109122
> toward handling these errors gracefully, the current best solution is to simply attempt to rerun the benchmark if it fails.
110123
111124
### Viewing the Scores
112125

113-
After a successful benchmark run, static HTML pages are generated that display scores on benchmarks and tests.
114-
These can be viewed by opening `web/index.html` in a web browser. E.g., `firefox web/index.html`.
126+
After a successful benchmark run, static HTML pages are generated that
127+
display scores on benchmarks and tests. These can be viewed by opening
128+
`web/index.html` in a web browser. E.g., `firefox web/index.html`.
129+
130+
Note that the HTML that ModelBench produces is an older version than is available
131+
on [the website](https://ailuminate.mlcommons.org/). Over time we'll simplify the
132+
direct ModelBench output to be more straightforward and more directly useful to
133+
people independently running ModelBench.
134+
135+
### Using the journal
136+
137+
As `modelbench` runs, it logs each important event to the journal. That includes
138+
every step of prompt processing. You can use that to extract most information
139+
that you might want about the run. The journal is a zstandard-compressed JSONL
140+
file, meaning that each line is a valid JSON object.
115141

116-
If you would like to dump the raw scores, you can do:
142+
There are many tools that can work with those files. In the example below, we
143+
use [jq](https://jqlang.github.io/jq/, a JSON swiss army knife. For more
144+
information on the journal, see [the documentation](docs/run-journal.md).
145+
146+
To dump the raw scores, you could do something like this
117147

118148
```shell
119-
poetry run modelbench grid -m 10 > scoring-grid.csv
149+
zstd -d -c $(ls run/journals/* | tail -1) | jq -rn ' ["sut", "hazard", "score", "reference score"], (inputs | select(.message=="hazard scored") | [.sut, .hazard, .score, .reference]) | @csv'
120150
```
121151

122-
To see all raw requests, responses, and annotations, do:
152+
That will produce CSV for each hazard scored, as well as showing the reference
153+
score for that hazard.
154+
155+
Or if you'd like to see the processing chain for a specific prompt, you could do:
123156

124157
```shell
125-
poetry run modelbench responses -m 10 response-output-dir
158+
zstd -d -c $(ls run/journals/* | tail -1) | jq -r 'select(.prompt_id=="airr_practice_1_0_41321")'
126159
```
127-
That will produce a series of CSV files, one per Harm, in the given output directory. Please note that many of the
128-
prompts may be uncomfortable or harmful to view, especially to people with a history of trauma related to one of the
129-
Harms that we test for. Consider carefully whether you need to view the prompts and responses, limit exposure to
130-
what's necessary, take regular breaks, and stop if you feel uncomfortable. For more information on the risks, see
131-
[this literature review on vicarious trauma](https://www.zevohealth.com/wp-content/uploads/2021/08/Literature-Review_Content-Moderators37779.pdf).
160+
161+
That should output a series of JSON objects showing the flow from `queuing item`
162+
to `item finished`.
163+
164+
**CAUTION**: Please note that many of the prompts may be uncomfortable or
165+
harmful to view, especially to people with a history of trauma related to
166+
one of the hazards that we test for. Consider carefully whether you need
167+
to view the prompts and responses, limit exposure to what's necessary,
168+
take regular breaks, and stop if you feel uncomfortable. For more
169+
information on the risks, see [this literature review on vicarious
170+
trauma](https://www.zevohealth.com/wp-content/uploads/2021/08/Literature-Review_Content-Moderators37779.pdf).
132171

133172
### Managing the Cache
134173

135-
To speed up runs, ModelBench caches calls to both SUTs and annotators. That's normally what a benchmark-runner wants.
136-
But if you have changed your SUT in a way that ModelBench can't detect, like by deploying a new version of your model
137-
to the same endpoint, you may have to manually delete the cache. Look in `run/suts` for an `sqlite` file that matches
138-
the name of your SUT and either delete it or move it elsewhere. The cache will be created anew on the next run.
174+
To speed up runs, ModelBench caches calls to both SUTs and
175+
annotators. That's normally what a benchmark-runner wants. But if you
176+
have changed your SUT in a way that ModelBench can't detect, like by
177+
deploying a new version of your model to the same endpoint, you may
178+
have to manually delete the cache. Look in `run/suts` for an `sqlite`
179+
file that matches the name of your SUT and either delete it or move it
180+
elsewhere. The cache will be created anew on the next run.
139181

140182
### Running the benchmark on your SUT
141183

142-
ModelBench uses the ModelGauge library to discover and manage SUTs. For an example of how you can run a benchmark
143-
against a custom SUT, check out this [tutorial](https://github.com/mlcommons/modelbench/blob/main/docs/add-a-sut.md).
184+
ModelBench uses the ModelGauge library to discover
185+
and manage SUTs. For an example of how you can run
186+
a benchmark against a custom SUT, check out this
187+
[tutorial](https://github.com/mlcommons/modelbench/blob/main/docs/add-a-sut.md).
144188

145189
## Contributing
146190

0 commit comments

Comments
 (0)