"prior probing" for "safety"/nsfw scoring (model rating)

labels: experimental, mechanistic_interpretability, alignment, wip

WIP: https://github.com/dmarx/whats-in-a-name/

generate a bunch of images using a low CFG and ambiguous prompt(s), like "wow".

classify output images as e.g. containing a pretty woman, objectified woman, nudity, genitalia, etc. Each class should be detected independent of the others.

this then gives us an empirical likelihood for generating images that satisfy those classifications.

tweak CFG and construct level curves to e.g. characterize how "thirsty" a model is by quantifying its propensity to generate e.g. naked ladies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prior_probing.md

prior_probing.md

"prior probing" for "safety"/nsfw scoring (model rating)

Files

prior_probing.md

Latest commit

History

prior_probing.md

File metadata and controls

"prior probing" for "safety"/nsfw scoring (model rating)