- Introduction
- Step 0 - Environment Setup
- Step 1 - Generating LLM Annotations
- Step 2 - Model Training
- Step 3 - Automatic Evaluation
Sotopia-RL is an utterance-level, attribution-based, and multi-dimensional social reward design method, trained using single-turn online RL It achieves state-of-the-art performance on social goal completion tasks in the SOTOPIA benchmark.
We first attribute episode-level rewards for multi-turn social interactions to in- dividual utterances with LLMs. Then, we construct a combined reward that includes multiple dimensions of rewards besides goal completion, allowing us to regularize the optimization pro- cess for goal completion. These rewards are used to guide the RL training of social agents.
@misc{yu2025sotopiarlrewarddesignsocial,
title={Sotopia-RL: Reward Design for Social Intelligence},
author={Haofei Yu and Zhengyang Qi and Yining Zhao and Kolby Nottingham and Keyang Xuan and Bodhisattwa Prasad Majumder and Hao Zhu and Paul Pu Liang and Jiaxuan You},
year={2025},
eprint={2508.03905},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.03905}
}
We recommend using conda
to manage a clean Python environment for sotopia-rl
.
conda create -n sotopia-rl python=3.10
conda activate sotopia-rl
curl -sSL https://install.python-poetry.org | python3
export PATH="$HOME/.local/bin:$PATH"
poetry install
A Redis database needs to be set up to connect to a Redis DB for loading and saving environment/session data and run this repo. For detailed instructions of setting up Redis database, please refer to this tutorial. Make sure to set up Redis OM url in conda environment
conda env config vars set REDIS_OM_URL="redis://:PASSWORD@server_name:port_num"
Set OpenAI API Key
conda env config vars set OPENAI_API_KEY="<your_openai_api_key>"
After setting these environment variables, run conda deactivate && conda activate sotopia-rl
to apply them.
To generate LLM annotations, you need to download the original sotopia-pi episodes file from the huggingface repository and place it in the data
folder. Then run the following command:
cd scripts/annotate
python process_sotopia_pi.py --data_dir /workspace/sotopia-rl/data --input_file sotopia_pi_episodes.jsonl --output_file sotopia_pi_bc_episodes.jsonl
This will generate a new file sotopia_pi_bc_episodes.jsonl
in the data
folder.
Then run the following command to generate the LLM annotations:
python sample_episodes_and_annotate.py --data_dir /workspace/sotopia-rl/data --llm_name gpt-4o --input_file sotopia_pi_bc_episodes.jsonl --output_file sotopia_pi_bc_episodes_annotated.jsonl
The annotations will need to be furuther processed into the format required by the training script. This can be done by running the following command:
cd ../data_process
python process_annotation_direct_attribution.py --data_dir /workspace/sotopia-rl/data --input_file sotopia_pi_bc_episodes_annotated.jsonl --reward_output_file sotopia_pi_bc_episodes_reward.json --grpo_output_file sotopia_pi_bc_episodes_grpo.json
Before training, make sure you have configured Accelerate correctly. Save your configurations in the following files under scripts/
:
accelerate_config_sft.yaml
accelerate_config_rm.yaml
accelerate_config_grpo.yaml
All training scripts should be run from the scripts/
directory.
We first collect self-play dialogue data using GPT-4o in the SOTOPIA environment. Then we fine-tune a Qwen2.5-7B-Instruct model using supervised fine-tuning (SFT) with LoRA.
Conversation data can be collected via Redis.
Use scripts/data_process/serialize.py
to extract and serialize the logs.
To load data from Redis and save it as .jsonl
:
from sotopia.database.logs import EpisodeLog
episode = EpisodeLog.get(pk = 'xxxxx')
episodes_to_jsonl(episodes, "xxxxx.jsonl")
This command performs supervised fine-tuning using LoRA on Qwen2.5-7B-Instruct.
export MODEL_PATH="<your_model_path>"
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 accelerate launch \
--config_file ./accelerate_config_sft.yaml \
--main_process_port 29512 \
./train_sft.py \
--model_name $MODEL_PATH \
--learning_rate 1e-4 \
--max_length 4096 \
--train_batch_size 2 \
--val_batch_size 1 \
--accumulation_steps 8 \
--num_epochs 500 \
--use_lora \ # Enable LoRA for parameter-efficient fine-tuning
--evaluation_steps 5 \ # Evaluate every 5 training steps
--sft_data_path ../data/sft_data_path.json \
--template_path ../evals/qwen2.5-7b.jinja \
--checkpoint_dir ../sft_checkpoints_qwen2.5-7b
export MODEL_PATH="<your_model_path>"
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 accelerate launch \
--config_file ./accelerate_config_rm.yaml \
--main_process_port 29500 \
./scripts/train_rm.py \
--model_name $MODEL_PATH \
--learning_rate 1e-5 \
--max_length 4096 \
--train_batch_size 1 \
--val_batch_size 1 \
--accumulation_steps 8 \
--num_epochs 30 \
--evaluation_steps 50 \
--reward_data_path ../data/rm_data_path.json \
--template_path ../evals/qwen2.5-7b.jinja \
--checkpoint_dir ../rm_checkpoints_qwen2.5-7b
We now use the behavior cloning model to generate self-play dialogues and train the agent using GRPO(Group Reward Policy Optimization)
Conversation data can be collected via Redis after running self-play using GPT-4o.
Use scripts/data_process/serialize.py
to extract and serialize the logs.
To load data from Redis and save it as .jsonl
:
from sotopia.database.logs import EpisodeLog
episode = EpisodeLog.get(pk = 'xxxxx')
episodes_to_jsonl(episodes, "xxxxx.jsonl")
export MODEL_PATH="<your_model_path>"
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 accelerate launch \
--config_file ./accelerate_config_grpo.yaml \
--main_process_port 29511 \
./train_grpo.py \
--model_name $MODEL_PATH \
--policy_adapter_path ../sft_checkpoints_qwen2.5-7b/best-checkpoints \
--reward_adapter_path ../rm_checkpoints_qwen2.5-7b/best-checkpoints \
--learning_rate 5e-6 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--grpo_data_path ../data/grpo_data_path.json \
--template_path ../evals/qwen2.5-7b.jinja \
--num_grpo_epochs 2 \
--use_lora_train_grpo \
--num_generations 16 \
--output_dir ../grpo_checkpoints
We first deploy SFT and GRPO model using vllm and, deploy reward model using danjgo, then we evaluate our model based on SOTOPIA-EVAL framework.
For details please see this section.