We proposed Alignment Tipping Process (ATP), a critical post-deployment risk specific to self-evolving LLM agents. ATP describes how continual real-world interaction can cause agents to gradually abandon their initial alignment constraints in favor of self-interested, reward-maximizing behaviors. We formalize ATP through two complementary paradigms: Self-Interested Exploration, which captures individual behavioral drift induced by repeated high-reward deviations, and Imitative Strategy Diffusion, which models the spread of deviant strategies across multi-agent systems. Experimental results demonstrate that alignment degrades rapidly during self-evolution, with aligned models converging toward unaligned states. Current reinforcement learning–based alignment techniques offer only fragile protection against this tipping process, revealing that model alignment is dynamic and vulnerable to feedback-driven decay.
We provide training and testing workflows for Self-Interested Exploration (Role-play Scenario) and Imitative Strategy Diffusion in this repository.
cd src
conda create -n atp python=3.12 -y
conda activate atp
bash install.shNotes: install.sh installs flash-attn by default. Make sure your CUDA toolkit version is above 12, otherwise flash-attn couldn't be properly installed. You can also run the workflow without flash-attn(which only used in GRPO training) by commenting line 22 in config/sa_llm_grpo.yaml and config/ma_llm_grpo.yaml.
Scripts use Hydra for configuration. Always run from the src directory.
- Files containing
sacorrespond to Self-Interested Exploration (Role-play Scenario). - Files containing
macorrespond to Imitative Strategy Diffusion.
src/scripts/dpo_sa_gen.py # Collect SA DPO raw pairs
src/scripts/dpo_ma_gen.py # Collect MA DPO raw pairs
src/scripts/convert_dpo.py # Convert raw pairs to LlamaFactory DPO preference format
src/scripts/llm_grpo.py # GRPO training (default SA, switch to MA via --config-name ma_llm_grpo)
src/scripts/test_sa.py # SA evaluation
src/scripts/test_ma.py # MA evaluation
Default config: src/config/dpo_sa_gen.yaml (dataset: src/config/dataset/atp_sa_llm.yaml, model: src/config/model/qwen3_8B.yaml).
cd src
python scripts/dpo_sa_gen.pyHydra overrides example (using llama3.1):
python scripts/dpo_sa_gen.py model=llama31Output: results/sa_llm/dpo_train_data/<model_name>/<timestamp>.json
cd src
python scripts/convert_dpo.py ../results/sa_llm/dpo_train_data/<model_name>/<timestamp>.json -o ../data/sa_llm/dpo_train_data/<model_name>.jsonFollow the instruction of LLaMA-Factory and install the environment.
Append the generated data's info to LLaMA-Factory's data/dataset_info.json.
For example:
[
...,
"arc_dpo_llama_sa_llm": {
"file_name": ".../ATP/data/sa_llm/dpo_train_data/<model_name>.json",
"ranking": true,
"columns": {
"prompt": "instruction",
"query": "input",
"chosen": "chosen",
"rejected": "rejected"
}
}
]Then you could use the config files in ATP/llama-factory-config.
For example:
cd LLaMA-Factory
llamafactory-cli train .../ATP/llama-factory-config/llama3_lora_dpo_arc.yamlDefault config: src/config/sa_llm_grpo.yaml. Train Qwen3-8B.
cd src
python scripts/llm_grpo.pyOverride training parameters(train llama3.1-8B):
python scripts/llm_grpo.py model_id="meta-llama/Meta-Llama-3.1-8B-Instruct"Note: If you used deepspeed during GRPO training (which is used by default), you should run zero_to_fp32.py in the src/training_output/<checkpoint_dir> to convert the weight to huggingface format.
python zero_to_fp32.py <checkpoint_dir> <output_dir> --safe_serializationFill trained model paths into the corresponding model configs, then run the test.
- DPO (LoRA) — edit
src/config/model/qwen3_8B_sa_dpo.yamlorsrc/config/model/llama31_sa_dpo.yaml:
lora_path: <your_dpo_lora_path>- GRPO — edit
src/config/model/qwen3_8B_sa_grpo.yamlorsrc/config/model/llama31_sa_grpo.yaml:
model_id: <your_grpo_ckpt_path>- Switch default model in
src/config/test_sa.yamlor using hydra overrides (example: to DPO):
defaults:
- model: qwen3_8B_sa_dpo
- _self_Run the test:
cd ATP/src
python scripts/test_sa.py
# or override model from CLI
python scripts/test_sa.py model=qwen3_8B_sa_dpoResults (editable in yaml): results/sa_llm/<model_name>-r<rounds>/
Default config: src/config/dpo_ma_gen.yaml (dataset: src/config/dataset/atp_ma_llm.yaml).
cd src
python scripts/dpo_ma_gen.pyOutput: results/ma_llm/dpo_train_data/<timestamp>.json
python scripts/convert_dpo.py ../results/ma_llm/dpo_train_data/<timestamp>.json -o ../data/ma_llm/dpo_train_data/ma_dpo.jsonSame as above Self-Interested Exploration (Role-play Scenario) section. Remember to use the converted data data/ma_llm/dpo_train_data/ma_dpo.json to train DPO.
Same as above Self-Interested Exploration (Role-play Scenario) section. Only switch to Imitative Strategy Diffusion GRPO config via --config-name ma_llm_grpo:
cd src
python scripts/llm_grpo.py --config-name ma_llm_grpoFill trained model paths into the corresponding model configs, then run the test.
- DPO (LoRA) — edit
src/config/model/qwen3_8B_ma_dpo.yaml:
lora_path: <your_dpo_lora_path>- GRPO — edit
src/config/model/qwen3_8B_ma_grpo.yaml:
model_id: <your_grpo_ckpt_path>- Switch default model in
src/config/test_ma.yamlor using hydra overrides (example: to GRPO):
defaults:
- model: qwen3_8B_ma_grpo
- _self_Run the test:
cd src
python scripts/test_ma.py
# or override model from CLI
python scripts/test_ma.py model=qwen3_8B_ma_grpoResults: results/ma_llm/<model_name>-r<max_rounds>-n<n_agents>/
test_ma.yaml also defines the evaluation grid via thresholds_abs and reward_sets. Default to reproduce the paper’s settings.
Datasets: src/config/dataset/atp_sa_llm.yaml, src/config/dataset/atp_ma_llm.yaml
Rewards: src/config/reward/atp_sa.yaml, src/config/reward/atp_ma.yaml
Training: src/config/train/grpo.yaml, src/config/sa_llm_grpo.yaml, src/config/ma_llm_grpo.yaml, llama-factory-config/*.yaml
Models: src/config/model/qwen3_8B.yaml and *_sa_dpo.yaml/*_sa_grpo.yaml/*_ma_dpo.yaml/*_ma_grpo.yaml
Testing: src/config/test_sa.yaml, src/config/test_ma.yaml
If you use this repository or results, please cite the paper:
@article{han2025alignment,
title={Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails},
author={Han, Siwei and Liu, Jiaqi and Su, Yaofeng and Duan, Wenbo and Liu, Xinyuan and Xie, Cihang and Bansal, Mohit and Ding, Mingyu and Zhang, Linjun and Yao, Huaxiu},
journal={arXiv preprint arXiv:2510.04860},
year={2025}
}