@@ -11,31 +11,40 @@ Run safety benchmarks against AI models and view detailed reports showing how we
11
11
12
12
## Background
13
13
14
- This is a [ MLCommons project] ( https://mlcommons.org/ai-safety ) , part of the [ AI Safety Working Group] ( https://mlcommons.org/working-groups/ai-safety/ai-safety/ ) .
15
- The project is at an early stage, and this should be considered a proof of concept. Results are not intended to
16
- indicate actual levels of AI system safety. You can see sample benchmarks [ here] ( https://mlcommons.org/benchmarks/ai-safety/ ) and read the white paper [ here] ( https://arxiv.org/abs/2404.12241 ) .
17
-
18
- ModelBench is closely related to [ ModelGauge] ( https://github.com/mlcommons/modelgauge ) . ModelGauge does most of the
19
- work of running Tests against SUTs (systems under test, that is machine learning models) and then using annotators
20
- to measure each response. This project, ModelBench, aggregates those measures, relates them to specific Harms, rolls
21
- those Harms up into Benchmarks, and produces reports. If you are looking to run a benchmark for your model, start by
22
- [ adding a SUT] ( https://github.com/mlcommons/modelgauge/blob/main/docs/tutorial_suts.md ) that works with ModelGauge.
14
+ This is a [ MLCommons project] ( https://mlcommons.org/ai-safety ) ,
15
+ part of the [ AI Risk & Reliability Working
16
+ Group] ( https://mlcommons.org/ai-risk-and-reliability/ ) .
17
+ The project is at an early stage. You can see sample benchmarks
18
+ [ here] ( https://ailuminate.mlcommons.org/benchmarks/ ) and our 0.5 white paper
19
+ [ here] ( https://arxiv.org/abs/2404.12241 ) .
20
+
21
+ This project now contains both ModelGauge and ModelBench. ModelGauge does
22
+ most of the work of running Tests against SUTs (systems under test,
23
+ that is machine learning models and related tech) and then using
24
+ annotators to measure each response. ModelBench aggregates those measures,
25
+ relates them to specific Hazards, rolls those Hazards up into Benchmarks, and
26
+ produces reports. If you are looking to run a benchmark for your model,
27
+ start by [ adding a SUT] ( docs/add-a-sut.md ) that works with ModelGauge.
23
28
24
29
## Requirements
25
30
26
- The current benchmark uses LlamaGuard to evaluate the safety of responses. For now you will need a [ TogetherAI] ( https://www.together.ai/ )
27
- account to use it. For 0.5, all included models are also run via TogetherAI. If you're adding a SUT, you can use
28
- any service you like, including hosting it yourself.
31
+ The current public practice benchmark uses LlamaGuard to
32
+ evaluate the safety of responses. For now you will need a
33
+ [ Together AI] ( https://www.together.ai/ ) account to use it. For 1.0, we test
34
+ models on a variety of services; if you want to duplicate our benchmarks
35
+ you will need accounts with those services as well. If you're adding a
36
+ SUT, you can use any service you like, including hosting it yourself.
29
37
30
- Note that running a full benchmark for all included models via TogetherAI currently takes about a week. Depending
31
- on response time, running your own SUT may be faster. We aim to speed things up substantially for 1.0. However, you
32
- can get lower-fidelity reports in minutes by running a benchmark with fewer items via the ` --max-instances ` or
33
- ` -m ` flag.
38
+ Note that running a full benchmark to match our public set takes
39
+ several days. Depending on response time, running your own SUT may be
40
+ faster. However, you can get lower-fidelity reports in minutes by running
41
+ a benchmark with fewer items via the ` --max-instances ` or ` -m ` flag.
34
42
35
43
## Installation
36
44
37
- Since this is under heavy development, the best way to run it is to check it out from GitHub. However, you can also
38
- install ModelBench as a CLI tool or library to use in your own projects.
45
+ Since this is under heavy development, the best way to run it is to
46
+ check it out from GitHub. However, you can also install ModelBench as
47
+ a CLI tool or library to use in your own projects.
39
48
40
49
### Install ModelBench with [ Poetry] ( https://python-poetry.org/ ) for local development.
41
50
@@ -57,8 +66,10 @@ cd modelbench
57
66
poetry install
58
67
```
59
68
60
- At this point you may optionally do ` poetry shell ` which will put you in a virtual environment that uses the installed packages
61
- for everything. If you do that, you don't have to explicitly say ` poetry run ` in the commands below.
69
+ At this point you may optionally do ` poetry shell ` which will put you in a
70
+ virtual environment that uses the installed packages for everything. If
71
+ you do that, you don't have to explicitly say ` poetry run ` in the
72
+ commands below.
62
73
63
74
### Install ModelBench from PyPI
64
75
@@ -77,15 +88,17 @@ poetry run pytest tests
77
88
78
89
## Trying It Out
79
90
80
- We encourage interested parties to try it out and give us feedback. For now, ModelBench is just a proof of
81
- concept, but over time we would like others to be able both test their own models and to create their own
82
- tests and benchmarks.
91
+ We encourage interested parties to try it out and give us feedback. For
92
+ now, ModelBench is mainly focused on us running our own benchmarks,
93
+ but over time we would like others to be able both test their own models
94
+ and to create their own tests and benchmarks.
83
95
84
96
### Running Your First Benchmark
85
97
86
- Before running any benchmarks, you'll need to create a secrets file that contains any necessary API keys and other sensitive information.
87
- Create a file at ` config/secrets.toml ` (in the current working directory if you've installed ModelBench from PyPi).
88
- You can use the following as a template.
98
+ Before running any benchmarks, you'll need to create a secrets file that
99
+ contains any necessary API keys and other sensitive information. Create a
100
+ file at ` config/secrets.toml ` (in the current working directory if you've
101
+ installed ModelBench from PyPi). You can use the following as a template.
89
102
90
103
``` toml
91
104
[together ]
@@ -101,46 +114,77 @@ Note: Omit `poetry run` in all example commands going forward if you've installe
101
114
poetry run modelbench benchmark -m 10
102
115
```
103
116
104
- You should immediately see progress indicators, and depending on how loaded TogetherAI is,
105
- the whole run should take about 15 minutes.
117
+ You should immediately see progress indicators, and depending on how
118
+ loaded Together AI is, the whole run should take about 15 minutes.
106
119
107
120
> [ !IMPORTANT]
108
121
> Sometimes, running a benchmark will fail due to temporary errors due to network issues, API outages, etc. While we are working
109
122
> toward handling these errors gracefully, the current best solution is to simply attempt to rerun the benchmark if it fails.
110
123
111
124
### Viewing the Scores
112
125
113
- After a successful benchmark run, static HTML pages are generated that display scores on benchmarks and tests.
114
- These can be viewed by opening ` web/index.html ` in a web browser. E.g., ` firefox web/index.html ` .
126
+ After a successful benchmark run, static HTML pages are generated that
127
+ display scores on benchmarks and tests. These can be viewed by opening
128
+ ` web/index.html ` in a web browser. E.g., ` firefox web/index.html ` .
129
+
130
+ Note that the HTML that ModelBench produces is an older version than is available
131
+ on [ the website] ( https://ailuminate.mlcommons.org/ ) . Over time we'll simplify the
132
+ direct ModelBench output to be more straightforward and more directly useful to
133
+ people independently running ModelBench.
134
+
135
+ ### Using the journal
136
+
137
+ As ` modelbench ` runs, it logs each important event to the journal. That includes
138
+ every step of prompt processing. You can use that to extract most information
139
+ that you might want about the run. The journal is a zstandard-compressed JSONL
140
+ file, meaning that each line is a valid JSON object.
115
141
116
- If you would like to dump the raw scores, you can do:
142
+ There are many tools that can work with those files. In the example below, we
143
+ use [ jq] (https://jqlang.github.io/jq/ , a JSON swiss army knife. For more
144
+ information on the journal, see [ the documentation] ( docs/run-journal.md ) .
145
+
146
+ To dump the raw scores, you could do something like this
117
147
118
148
``` shell
119
- poetry run modelbench grid -m 10 > scoring-grid. csv
149
+ zstd -d -c $( ls run/journals/ * | tail -1 ) | jq -rn ' ["sut", "hazard", "score", "reference score"], (inputs | select(.message=="hazard scored") | [.sut, .hazard, .score, .reference]) | @ csv'
120
150
```
121
151
122
- To see all raw requests, responses, and annotations, do:
152
+ That will produce CSV for each hazard scored, as well as showing the reference
153
+ score for that hazard.
154
+
155
+ Or if you'd like to see the processing chain for a specific prompt, you could do:
123
156
124
157
``` shell
125
- poetry run modelbench responses -m 10 response-output-dir
158
+ zstd -d -c $( ls run/journals/ * | tail -1 ) | jq -r ' select(.prompt_id=="airr_practice_1_0_41321") '
126
159
```
127
- That will produce a series of CSV files, one per Harm, in the given output directory. Please note that many of the
128
- prompts may be uncomfortable or harmful to view, especially to people with a history of trauma related to one of the
129
- Harms that we test for. Consider carefully whether you need to view the prompts and responses, limit exposure to
130
- what's necessary, take regular breaks, and stop if you feel uncomfortable. For more information on the risks, see
131
- [ this literature review on vicarious trauma] ( https://www.zevohealth.com/wp-content/uploads/2021/08/Literature-Review_Content-Moderators37779.pdf ) .
160
+
161
+ That should output a series of JSON objects showing the flow from ` queuing item `
162
+ to ` item finished ` .
163
+
164
+ ** CAUTION** : Please note that many of the prompts may be uncomfortable or
165
+ harmful to view, especially to people with a history of trauma related to
166
+ one of the hazards that we test for. Consider carefully whether you need
167
+ to view the prompts and responses, limit exposure to what's necessary,
168
+ take regular breaks, and stop if you feel uncomfortable. For more
169
+ information on the risks, see [ this literature review on vicarious
170
+ trauma] ( https://www.zevohealth.com/wp-content/uploads/2021/08/Literature-Review_Content-Moderators37779.pdf ) .
132
171
133
172
### Managing the Cache
134
173
135
- To speed up runs, ModelBench caches calls to both SUTs and annotators. That's normally what a benchmark-runner wants.
136
- But if you have changed your SUT in a way that ModelBench can't detect, like by deploying a new version of your model
137
- to the same endpoint, you may have to manually delete the cache. Look in ` run/suts ` for an ` sqlite ` file that matches
138
- the name of your SUT and either delete it or move it elsewhere. The cache will be created anew on the next run.
174
+ To speed up runs, ModelBench caches calls to both SUTs and
175
+ annotators. That's normally what a benchmark-runner wants. But if you
176
+ have changed your SUT in a way that ModelBench can't detect, like by
177
+ deploying a new version of your model to the same endpoint, you may
178
+ have to manually delete the cache. Look in ` run/suts ` for an ` sqlite `
179
+ file that matches the name of your SUT and either delete it or move it
180
+ elsewhere. The cache will be created anew on the next run.
139
181
140
182
### Running the benchmark on your SUT
141
183
142
- ModelBench uses the ModelGauge library to discover and manage SUTs. For an example of how you can run a benchmark
143
- against a custom SUT, check out this [ tutorial] ( https://github.com/mlcommons/modelbench/blob/main/docs/add-a-sut.md ) .
184
+ ModelBench uses the ModelGauge library to discover
185
+ and manage SUTs. For an example of how you can run
186
+ a benchmark against a custom SUT, check out this
187
+ [ tutorial] ( https://github.com/mlcommons/modelbench/blob/main/docs/add-a-sut.md ) .
144
188
145
189
## Contributing
146
190
0 commit comments