atkgen: Attack Generation #765

VikasPahuja · 2024-07-01T10:36:41Z

VikasPahuja
Jul 1, 2024

Hi,

Thank you for your excellent work. I have a question. Currently, the ATKGen probe supports a fine-tuned GPT-2 model for auto-redteaming. Are there any plans to develop support for other models, such as fine-tuned versions of GPT-3, GPT-4, OpenAI's models, or LLaMA models etc? Additionally, are there future plans to use this probe for categories beyond toxicity, such as bias and fairness, misinformation, hate speech, or harassment?

leondz · 2024-07-02T15:57:49Z

leondz
Jul 2, 2024
Maintainer

Yes and yes! We're constrained by two factors: data licensing (training on OpenAI-derived data isn't always easy), and getting a reliable detector with which to filter the data for a given category, so we only include conversations with real "hits".

What kind of category would you like to see?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

atkgen: Attack Generation #765

{{title}}

Replies: 1 comment

{{title}}

Select a reply

atkgen: Attack Generation #765

VikasPahuja Jul 1, 2024

Replies: 1 comment

leondz Jul 2, 2024 Maintainer

VikasPahuja
Jul 1, 2024

leondz
Jul 2, 2024
Maintainer