We introduce AegisLLM, a cooperative multi-agent defense against adversarial attacks and information leakage. In AegisLLM, a structured workflow of autonomous agents - orchestrator, deflector, responder, and evaluator - collaborate to ensure safe and compliant LLM outputs, while self-improving over time through prompt optimization. We show that scaling agentic reasoning system at test-time - both by incorporating additional agent roles and by leveraging automated prompt optimization (such as DSPy)- substantially enhances robustness without compromising model utility. This test-time defense enables real-time adaptability to evolving attacks, without requiring model retraining. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM. On the WMDP unlearning benchmark, AegisLLM achieves near-perfect unlearning with only 20 training examples and fewer than 300 LM calls. For jailbreaking benchmarks, we achieve 51% improvement compared to the base model on StrongReject, with false refusal rates of only 7.9% on PHTest compared to 18-55% for comparable methods. Our results highlight the advantages of adaptive, agentic reasoning over static defenses, establishing AegisLLM as a strong runtime alternative to traditional approaches based on model modifications.
configs/
: Contains configuration files for the experiments.data/
: Configurations for loading data points for the experimentsdefense/
: Configurations for the defense mechanisms (unlearning, jailbreaking) used in the experimentsbase_unl.yaml
: Top-level configuration file for unlearning experimentsbase_jail.yaml
: Top-level configuration file for jailbreak experiments
pipelines/
: Contains the agentic pipelines for different agentic architectures.utils/
: Logging and Litellm utility tools.env
: Environment variables file to store API keys and other sensitive information. Make sure to create this file and add your API keysrequirements.txt
: File containing the required dependencies for the project (for our running environment)
optimize_unlearning_dspy.py
: Script for DSPy optimization on the WMDP and MMLU benchmarks for unlearningoptimize_jailbreak_dspy.py
: Script for DSPy optimization on the the StrongReject and FalseRefusal benchmarks for jailbreaking- Unlearning Runs:
run_unlearning_mcq.py
: Script for running unlearning experiments for multiple-choice questions (i.e. on the WMDP and MMLU benchmarks)run_unlearning_mt_bench.py
: Script for running unlearning experiments for the mt_bench benchmarkrun_unlearning_tofu.py
: Script for running unlearning experiments for the TOFU benchmark
- Jailbreak Runs:
run_jailbreak_baseline.py
: Script for running the jailbreak experiments using the baseline modelrun_jailbreak_llamaguard.py
: Script for running the jailbreak experiments using the Llama-Guard modelrun_jailbreak_dspy.py
: Script for running jailbreak experiments using the AegisLLM pipeline (optimized with DSPy)
-
Clone the repository:
git clone <repository-url> cd <repository-directory>
-
Modify the
.env
file to include the required API keys for your LLM model(s). -
Configure the
model_provider
andmodel_name
values in theconfigs/base.yaml
file based on the provided API keys. -
Install the required dependencies:
pip install -r requirements.txt
You can use different modes to benchmark different categories of experiments for the corresponding pipelines. The following modes are supported:
base
: Run using the base (original) model for pipeline without any specific defense mechanisms applied.prompting
: Run using the prompting baseline defense mechanism for pipeline. See Guardrail Baselines for Unlearning in LLMs for details.filtering
: Run using the filtering baseline defense mechanism for pipeline. See Guardrail Baselines for Unlearning in LLMs for details.dspy-base
: Run using the AegisLLM's pipeline without DSPy optimization.dspy-json
: Run using the AegisLLM's pipeline with DSPy optimization. You have to have run theoptimize_unlearning_dspy.py
andoptimize_jailbreak_dspy.py
scripts before running in this mode for the unlearning and jailbreaking experiments, respectively.
DSPy-base and DSPy-json modes correspond to our method.
Note: To run any unlearning experiments in the dspy-json
mode (WMDP/MMLU and MT-Bench), you have to first optimize the DSPy pipeline (Orchestrator only) using the following script:
python3 optimize_unlearning_dspy.py +data=wmdp_cyber +defense=unl_wmdp
The resulting optimization will be saved in the dspy_optimized_dir
directory specified in configs/base_unl.yaml
. You only need to run this optimization once for any WMDP/MMLU, and MT-Bench experiments, and can use it without limitations afterwards.
You must choose mode based on your experimental setup.
python3 run_unlearning_mcq.py +data=<wmdp_cyber | wmdp_bio | wmdp_chem | mmlu> +defense=unl_wmdp mode=<base | prompting | filtering | dspy-base | dspy-json>
Note: Please also check out the file run_unl_mcq.sh
for an example of how this script can be used in practice in a bash-script.
You must choose mode based on your experimental setup. Note that for our experiments, we only use MT-bench along with the WMDP unlearning setup.
python3 run_unlearning_mt_bench.py +data=mt_bench +defense=unl_wmdp mode=<base | prompting | filtering | dspy-base | dspy-json>
You must choose mode based on your experimental setup. A limited set of modes is currently supported.
python3 run_unlearning_tofu.py +data=tofu +defense=unl_tofu mode=<filtering | dspy-base> data.data.subset=<forget01 | forget05 | forget10>
For the jailbreak experiments, you need to first run Python scripts to generate the data points, which will later be evaluated (see Jailbreak Defense Evaluations).
You can use one of the following methods (each method supports :
- Baseline
python3 run_jailbreak_baseline.py +data=<false_refusal | strong_reject> +defense=<jail_false_refusal | jail_strong_reject>
- Llama-Guard
python3 run_jailbreak_llamaguard.py +data=<false_refusal | strong_reject> +defense=<jail_false_refusal | jail_strong_reject>
- Jailbreak (AegisLLM):
Note: To run this experiment in the dspy-json
mode, you have to first optimize the DSPy pipeline (Orchestrator and Evaluator) using the following script:
python3 optimize_jailbreak_dspy.py +data=<false_refusal | strong_reject> +defense=<jail_false_refusal | jail_strong_reject>
Currently, both options for data
and defense
perform the same optimization, and you can choose either one. You would only need to run the optimization once. The optimized pipeline will be saved in the dspy_optimized_dir
directory specified in configs/base_jail.yaml
. For future developments, such optimization configs could be separated for each data and defense combination.
After any optimizations are done, to run this experiment, you can run the following script. A limited set of modes is currently supported.
python3 run_jailbreak_dspy.py +data=<false_refusal | strong_reject> +defense=<jail_false_refusal | jail_strong_reject> mode=<dspy-base | dspy-json>
After running the false_refusal or strong_reject Python scripts for the jailbreak experiments, you can evaluate the results using the following scripts for each of the corresponding benchmarks. Each of the scripts in the Jailbreak Defenses section will provide you with an output JSON/JSONl file path, which you can use as input to the evaluation scripts in the following scripts.
- False Refusals (PHTest)
For this script, you need to provide the input path of the JSON/JSONl file generated by experiments run, the judge model names and API access information, and an arbitrary output path which would be used to save the resulting evaluation information. You will be provided with false_refusal scores in your shell after the run is complete. Please note that the script supports extra input parameters that you can check out using the
--help
flag.
python3 evals/false_refusal/eval_false_refusal.py --input-path <input_path> --output-path <arbitrary_output_path> --model-name <model_name>
- Strong Reject
Similar to the script for evaluating False Refusals, you need to provide the input path of the JSON/JSONl file generated by experiments run and an arbitrary output path which would be used to save the resulting evaluation information for the evaluations script for Strong Reject. You will be provided with strong_reject scores in your shell after the run is complete. Please note that the script supports extra input parameters that you can check out using the
--help
flag.
python3 evals/strong_reject/eval_strong_reject.py --input-path <input_path> --output-path <arbitrary_output_path>
This project is licensed under the MIT License. See the LICENSE
file for more details.
Please cite our paper:
@article{cai2025aegisllmscalingagenticsystems,
title={AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security},
author={Zikui Cai and Shayan Shabihi and Bang An and Zora Che and Brian R. Bartoldson and Bhavya Kailkhura and Tom Goldstein and Furong Huang},
year={2025},
eprint={2504.20965},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.20965},
}