atkgen: Attack Generation #765
Closed
VikasPahuja
started this conversation in
General
Replies: 1 comment
-
Yes and yes! We're constrained by two factors: data licensing (training on OpenAI-derived data isn't always easy), and getting a reliable detector with which to filter the data for a given category, so we only include conversations with real "hits". What kind of category would you like to see? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
Thank you for your excellent work. I have a question. Currently, the ATKGen probe supports a fine-tuned GPT-2 model for auto-redteaming. Are there any plans to develop support for other models, such as fine-tuned versions of GPT-3, GPT-4, OpenAI's models, or LLaMA models etc? Additionally, are there future plans to use this probe for categories beyond toxicity, such as bias and fairness, misinformation, hate speech, or harassment?
Beta Was this translation helpful? Give feedback.
All reactions