Alignment Tipping Process (ATP)

Overview

We proposed Alignment Tipping Process (ATP), a critical post-deployment risk specific to self-evolving LLM agents. ATP describes how continual real-world interaction can cause agents to gradually abandon their initial alignment constraints in favor of self-interested, reward-maximizing behaviors. We formalize ATP through two complementary paradigms: Self-Interested Exploration, which captures individual behavioral drift induced by repeated high-reward deviations, and Imitative Strategy Diffusion, which models the spread of deviant strategies across multi-agent systems. Experimental results demonstrate that alignment degrades rapidly during self-evolution, with aligned models converging toward unaligned states. Current reinforcement learning–based alignment techniques offer only fragile protection against this tipping process, revealing that model alignment is dynamic and vulnerable to feedback-driven decay.

We provide training and testing workflows for Self-Interested Exploration (Role-play Scenario) and Imitative Strategy Diffusion in this repository.

Environment Setup

cd src
conda create -n atp python=3.12 -y
conda activate atp
bash install.sh

Notes: install.sh installs flash-attn by default. Make sure your CUDA toolkit version is above 12, otherwise flash-attn couldn't be properly installed. You can also run the workflow without flash-attn(which only used in GRPO training) by commenting line 22 in config/sa_llm_grpo.yaml and config/ma_llm_grpo.yaml.

Directory and Naming Conventions

Scripts use Hydra for configuration. Always run from the src directory.

Files containing sa correspond to Self-Interested Exploration (Role-play Scenario).
Files containing ma correspond to Imitative Strategy Diffusion.

src/scripts/dpo_sa_gen.py     # Collect SA DPO raw pairs
src/scripts/dpo_ma_gen.py     # Collect MA DPO raw pairs
src/scripts/convert_dpo.py    # Convert raw pairs to LlamaFactory DPO preference format
src/scripts/llm_grpo.py       # GRPO training (default SA, switch to MA via --config-name ma_llm_grpo)
src/scripts/test_sa.py        # SA evaluation
src/scripts/test_ma.py        # MA evaluation

Self-Interested Exploration (Role-play Scenario)

Data Collection (DPO raw samples)

Default config: src/config/dpo_sa_gen.yaml (dataset: src/config/dataset/atp_sa_llm.yaml, model: src/config/model/qwen3_8B.yaml).

cd src
python scripts/dpo_sa_gen.py

Hydra overrides example (using llama3.1):

python scripts/dpo_sa_gen.py model=llama31

Output: results/sa_llm/dpo_train_data/<model_name>/<timestamp>.json

Convert to LLaMA-Factory DPO format

cd src
python scripts/convert_dpo.py ../results/sa_llm/dpo_train_data/<model_name>/<timestamp>.json -o ../data/sa_llm/dpo_train_data/<model_name>.json

Train DPO with LLaMA-Factory

Follow the instruction of LLaMA-Factory and install the environment.

Append the generated data's info to LLaMA-Factory's data/dataset_info.json. For example:

[
  ...,
  "arc_dpo_llama_sa_llm": {
    "file_name": ".../ATP/data/sa_llm/dpo_train_data/<model_name>.json",
    "ranking": true,
    "columns": {
      "prompt": "instruction",
      "query": "input",
      "chosen": "chosen",
      "rejected": "rejected"
    }
  }
]

Then you could use the config files in ATP/llama-factory-config. For example:

cd LLaMA-Factory
llamafactory-cli train .../ATP/llama-factory-config/llama3_lora_dpo_arc.yaml

Training (GRPO)

Default config: src/config/sa_llm_grpo.yaml. Train Qwen3-8B.

cd src
python scripts/llm_grpo.py

Override training parameters(train llama3.1-8B):

python scripts/llm_grpo.py model_id="meta-llama/Meta-Llama-3.1-8B-Instruct"

Note: If you used deepspeed during GRPO training (which is used by default), you should run zero_to_fp32.py in the src/training_output/<checkpoint_dir> to convert the weight to huggingface format.

python zero_to_fp32.py <checkpoint_dir> <output_dir> --safe_serialization

Testing

Fill trained model paths into the corresponding model configs, then run the test.

DPO (LoRA) — edit src/config/model/qwen3_8B_sa_dpo.yaml or src/config/model/llama31_sa_dpo.yaml:

lora_path: <your_dpo_lora_path>

GRPO — edit src/config/model/qwen3_8B_sa_grpo.yaml or src/config/model/llama31_sa_grpo.yaml:

model_id: <your_grpo_ckpt_path>

Switch default model in src/config/test_sa.yaml or using hydra overrides (example: to DPO):

defaults:
  - model: qwen3_8B_sa_dpo
  - _self_

Run the test:

cd ATP/src
python scripts/test_sa.py

# or override model from CLI
python scripts/test_sa.py model=qwen3_8B_sa_dpo

Results (editable in yaml): results/sa_llm/<model_name>-r<rounds>/

Imitative Strategy Diffusion

Data Collection (DPO raw samples)

Default config: src/config/dpo_ma_gen.yaml (dataset: src/config/dataset/atp_ma_llm.yaml).

cd src
python scripts/dpo_ma_gen.py

Output: results/ma_llm/dpo_train_data/<timestamp>.json

Convert to LlamaFactory DPO format

python scripts/convert_dpo.py ../results/ma_llm/dpo_train_data/<timestamp>.json -o ../data/ma_llm/dpo_train_data/ma_dpo.json

Training (DPO)

Same as above Self-Interested Exploration (Role-play Scenario) section. Remember to use the converted data data/ma_llm/dpo_train_data/ma_dpo.json to train DPO.

Training (GRPO)

Same as above Self-Interested Exploration (Role-play Scenario) section. Only switch to Imitative Strategy Diffusion GRPO config via --config-name ma_llm_grpo:

cd src
python scripts/llm_grpo.py --config-name ma_llm_grpo

Testing

Fill trained model paths into the corresponding model configs, then run the test.

DPO (LoRA) — edit src/config/model/qwen3_8B_ma_dpo.yaml:

lora_path: <your_dpo_lora_path>

GRPO — edit src/config/model/qwen3_8B_ma_grpo.yaml:

model_id: <your_grpo_ckpt_path>

Switch default model in src/config/test_ma.yaml or using hydra overrides (example: to GRPO):

defaults:
  - model: qwen3_8B_ma_grpo
  - _self_

Run the test:

cd src
python scripts/test_ma.py

# or override model from CLI
python scripts/test_ma.py model=qwen3_8B_ma_grpo

Results: results/ma_llm/<model_name>-r<max_rounds>-n<n_agents>/

test_ma.yaml also defines the evaluation grid via thresholds_abs and reward_sets. Default to reproduce the paper’s settings.

Configuration Reference (aligns with the paper)

Datasets:   src/config/dataset/atp_sa_llm.yaml, src/config/dataset/atp_ma_llm.yaml
Rewards:    src/config/reward/atp_sa.yaml, src/config/reward/atp_ma.yaml
Training:   src/config/train/grpo.yaml, src/config/sa_llm_grpo.yaml, src/config/ma_llm_grpo.yaml, llama-factory-config/*.yaml
Models:     src/config/model/qwen3_8B.yaml and *_sa_dpo.yaml/*_sa_grpo.yaml/*_ma_dpo.yaml/*_ma_grpo.yaml
Testing:    src/config/test_sa.yaml, src/config/test_ma.yaml

Citation

If you use this repository or results, please cite the paper:

@article{han2025alignment,
  title={Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails},
  author={Han, Siwei and Liu, Jiaqi and Su, Yaofeng and Duan, Wenbo and Liu, Xinyuan and Xie, Cihang and Bansal, Mohit and Ding, Mingyu and Zhang, Linjun and Yao, Huaxiu},
  journal={arXiv preprint arXiv:2510.04860},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
llama-factory-config		llama-factory-config
media		media
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Alignment Tipping Process (ATP)

Overview

Environment Setup

Directory and Naming Conventions

Self-Interested Exploration (Role-play Scenario)

Data Collection (DPO raw samples)

Convert to LLaMA-Factory DPO format

Train DPO with LLaMA-Factory

Training (GRPO)

Testing

Imitative Strategy Diffusion

Data Collection (DPO raw samples)

Convert to LlamaFactory DPO format

Training (DPO)

Training (GRPO)

Testing

Configuration Reference (aligns with the paper)

Citation

About

Uh oh!

Languages

License

aiming-lab/ATP

Folders and files

Latest commit

History

Repository files navigation

Alignment Tipping Process (ATP)

Overview

Environment Setup

Directory and Naming Conventions

Self-Interested Exploration (Role-play Scenario)

Data Collection (DPO raw samples)

Convert to LLaMA-Factory DPO format

Train DPO with LLaMA-Factory

Training (GRPO)

Testing

Imitative Strategy Diffusion

Data Collection (DPO raw samples)

Convert to LlamaFactory DPO format

Training (DPO)

Training (GRPO)

Testing

Configuration Reference (aligns with the paper)

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages