AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security

We introduce AegisLLM, a cooperative multi-agent defense against adversarial attacks and information leakage. In AegisLLM, a structured workflow of autonomous agents - orchestrator, deflector, responder, and evaluator - collaborate to ensure safe and compliant LLM outputs, while self-improving over time through prompt optimization. We show that scaling agentic reasoning system at test-time - both by incorporating additional agent roles and by leveraging automated prompt optimization (such as DSPy)- substantially enhances robustness without compromising model utility. This test-time defense enables real-time adaptability to evolving attacks, without requiring model retraining. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM. On the WMDP unlearning benchmark, AegisLLM achieves near-perfect unlearning with only 20 training examples and fewer than 300 LM calls. For jailbreaking benchmarks, we achieve 51% improvement compared to the base model on StrongReject, with false refusal rates of only 7.9% on PHTest compared to 18-55% for comparable methods. Our results highlight the advantages of adaptive, agentic reasoning over static defenses, establishing AegisLLM as a strong runtime alternative to traditional approaches based on model modifications.

Project Structure

Key Files and Directories

configs/: Contains configuration files for the experiments.
- data/: Configurations for loading data points for the experiments
- defense/: Configurations for the defense mechanisms (unlearning, jailbreaking) used in the experiments
- base_unl.yaml: Top-level configuration file for unlearning experiments
- base_jail.yaml: Top-level configuration file for jailbreak experiments
pipelines/: Contains the agentic pipelines for different agentic architectures.
utils/: Logging and Litellm utility tools
.env: Environment variables file to store API keys and other sensitive information. Make sure to create this file and add your API keys
requirements.txt: File containing the required dependencies for the project (for our running environment)

Scripts

optimize_unlearning_dspy.py: Script for DSPy optimization on the WMDP and MMLU benchmarks for unlearning
optimize_jailbreak_dspy.py: Script for DSPy optimization on the the StrongReject and FalseRefusal benchmarks for jailbreaking
Unlearning Runs:
- run_unlearning_mcq.py: Script for running unlearning experiments for multiple-choice questions (i.e. on the WMDP and MMLU benchmarks)
- run_unlearning_mt_bench.py: Script for running unlearning experiments for the mt_bench benchmark
- run_unlearning_tofu.py: Script for running unlearning experiments for the TOFU benchmark
Jailbreak Runs:
- run_jailbreak_baseline.py: Script for running the jailbreak experiments using the baseline model
- run_jailbreak_llamaguard.py: Script for running the jailbreak experiments using the Llama-Guard model
- run_jailbreak_dspy.py: Script for running jailbreak experiments using the AegisLLM pipeline (optimized with DSPy)

Setup

Clone the repository:

git clone <repository-url>
cd <repository-directory>

Modify the .env file to include the required API keys for your LLM model(s).
Configure the model_provider and model_name values in the configs/base.yaml file based on the provided API keys.
Install the required dependencies:
```
pip install -r requirements.txt
```

Experimentation

Modes

You can use different modes to benchmark different categories of experiments for the corresponding pipelines. The following modes are supported:

base: Run using the base (original) model for pipeline without any specific defense mechanisms applied.
prompting: Run using the prompting baseline defense mechanism for pipeline. See Guardrail Baselines for Unlearning in LLMs for details.
filtering: Run using the filtering baseline defense mechanism for pipeline. See Guardrail Baselines for Unlearning in LLMs for details.
dspy-base: Run using the AegisLLM's pipeline without DSPy optimization.
dspy-json: Run using the AegisLLM's pipeline with DSPy optimization. You have to have run the optimize_unlearning_dspy.py and optimize_jailbreak_dspy.py scripts before running in this mode for the unlearning and jailbreaking experiments, respectively.

DSPy-base and DSPy-json modes correspond to our method.

Running Experiments

Unlearning Experiments

Note: To run any unlearning experiments in the dspy-json mode (WMDP/MMLU and MT-Bench), you have to first optimize the DSPy pipeline (Orchestrator only) using the following script:

python3 optimize_unlearning_dspy.py +data=wmdp_cyber +defense=unl_wmdp

The resulting optimization will be saved in the dspy_optimized_dir directory specified in configs/base_unl.yaml. You only need to run this optimization once for any WMDP/MMLU, and MT-Bench experiments, and can use it without limitations afterwards.

WMDP/MMLU

You must choose mode based on your experimental setup.

python3 run_unlearning_mcq.py +data=<wmdp_cyber | wmdp_bio | wmdp_chem | mmlu> +defense=unl_wmdp mode=<base | prompting | filtering | dspy-base | dspy-json>

Note: Please also check out the file run_unl_mcq.sh for an example of how this script can be used in practice in a bash-script.

MT_Bench

You must choose mode based on your experimental setup. Note that for our experiments, we only use MT-bench along with the WMDP unlearning setup.

python3 run_unlearning_mt_bench.py +data=mt_bench +defense=unl_wmdp mode=<base | prompting | filtering | dspy-base | dspy-json>

TOFU

You must choose mode based on your experimental setup. A limited set of modes is currently supported.

python3 run_unlearning_tofu.py +data=tofu +defense=unl_tofu mode=<filtering | dspy-base> data.data.subset=<forget01 | forget05 | forget10>

Jailbreak Experiments

For the jailbreak experiments, you need to first run Python scripts to generate the data points, which will later be evaluated (see Jailbreak Defense Evaluations).

Jailbreak Defenses

You can use one of the following methods (each method supports :

Baseline

python3 run_jailbreak_baseline.py +data=<false_refusal | strong_reject> +defense=<jail_false_refusal | jail_strong_reject>

Llama-Guard

python3 run_jailbreak_llamaguard.py +data=<false_refusal | strong_reject> +defense=<jail_false_refusal | jail_strong_reject>

Jailbreak (AegisLLM):

Note: To run this experiment in the dspy-json mode, you have to first optimize the DSPy pipeline (Orchestrator and Evaluator) using the following script:

python3 optimize_jailbreak_dspy.py +data=<false_refusal | strong_reject> +defense=<jail_false_refusal | jail_strong_reject>

Currently, both options for data and defense perform the same optimization, and you can choose either one. You would only need to run the optimization once. The optimized pipeline will be saved in the dspy_optimized_dir directory specified in configs/base_jail.yaml. For future developments, such optimization configs could be separated for each data and defense combination.

After any optimizations are done, to run this experiment, you can run the following script. A limited set of modes is currently supported.

python3 run_jailbreak_dspy.py +data=<false_refusal | strong_reject> +defense=<jail_false_refusal | jail_strong_reject> mode=<dspy-base | dspy-json>

Jailbreak Defense Evaluations

After running the false_refusal or strong_reject Python scripts for the jailbreak experiments, you can evaluate the results using the following scripts for each of the corresponding benchmarks. Each of the scripts in the Jailbreak Defenses section will provide you with an output JSON/JSONl file path, which you can use as input to the evaluation scripts in the following scripts.

False Refusals (PHTest) For this script, you need to provide the input path of the JSON/JSONl file generated by experiments run, the judge model names and API access information, and an arbitrary output path which would be used to save the resulting evaluation information. You will be provided with false_refusal scores in your shell after the run is complete. Please note that the script supports extra input parameters that you can check out using the --help flag.

python3 evals/false_refusal/eval_false_refusal.py --input-path <input_path> --output-path <arbitrary_output_path> --model-name <model_name>

Strong Reject Similar to the script for evaluating False Refusals, you need to provide the input path of the JSON/JSONl file generated by experiments run and an arbitrary output path which would be used to save the resulting evaluation information for the evaluations script for Strong Reject. You will be provided with strong_reject scores in your shell after the run is complete. Please note that the script supports extra input parameters that you can check out using the --help flag.

python3 evals/strong_reject/eval_strong_reject.py --input-path <input_path> --output-path <arbitrary_output_path>

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Citations

Please cite our paper:

@article{cai2025aegisllmscalingagenticsystems,
      title={AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security}, 
      author={Zikui Cai and Shayan Shabihi and Bang An and Zora Che and Brian R. Bartoldson and Bhavya Kailkhura and Tom Goldstein and Furong Huang},
      year={2025},
      eprint={2504.20965},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.20965}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security

Project Structure

Key Files and Directories

Scripts

Setup

Experimentation

Modes

Running Experiments

Unlearning Experiments

WMDP/MMLU

MT_Bench

TOFU

Jailbreak Experiments

Jailbreak Defenses

Jailbreak Defense Evaluations

License

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
agents		agents
configs		configs
evals		evals
pipeline_modules		pipeline_modules
pipelines		pipelines
unsafe_files		unsafe_files
utils		utils
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
optimize_jailbreak_dspy.py		optimize_jailbreak_dspy.py
optimize_unlearning_dspy.py		optimize_unlearning_dspy.py
requirements.txt		requirements.txt
run.sh		run.sh
run_jailbreak_baseline.py		run_jailbreak_baseline.py
run_jailbreak_dspy.py		run_jailbreak_dspy.py
run_jailbreak_llamaguard.py		run_jailbreak_llamaguard.py
run_jailbreak_self_examinations.py		run_jailbreak_self_examinations.py
run_unl_mcq.sh		run_unl_mcq.sh
run_unl_mt_bench.sh		run_unl_mt_bench.sh
run_unl_tofu.sh		run_unl_tofu.sh
run_unlearning_mcq.py		run_unlearning_mcq.py
run_unlearning_mt_bench.py		run_unlearning_mt_bench.py
run_unlearning_tofu.py		run_unlearning_tofu.py

License

zikuicai/aegisllm

Folders and files

Latest commit

History

Repository files navigation

AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security

Project Structure

Key Files and Directories

Scripts

Setup

Experimentation

Modes

Running Experiments

Unlearning Experiments

WMDP/MMLU

MT_Bench

TOFU

Jailbreak Experiments

Jailbreak Defenses

Jailbreak Defense Evaluations

License

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages