DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments

"The journey of strategic games is endless, where agents evolve to master the art of decision-making. This release represents our first step towards that vision."

📄 Paper • 🌍 Project Page • 🎮 Game Wiki • 💻 Code

📖 Introduction

DSGBench is a novel strategic game benchmark designed to evaluate the performance of LLM-based agents in strategic planning, real-time decision making, adaptability, and multi-agent interactions. The benchmark encompasses six high-dynamics, complex strategy games, including StarCraft II, Civilization, Street Fighter III, and others. It provides fine-grained metrics for comprehensive evaluation, along with detailed decision trajectory analysis and trajectory datasets.

🌟 Highlights

Strategic Games: DSGBench fills an important gap in evaluating agents in complex strategic environments. It reflects the agent's overall capabilities in real-world applications, beyond just individual skill tests.
Scenario Diversity: DSGBench offers a variety of scenario settings, allowing comprehensive testing of agent adaptability and generalization. An ideal agent should excel in diverse environments and demonstrate cross-scenario generalization rather than optimizing for a single task.
Fine-grained Evaluation Metrics and Decision Trajectory Analysis: DSGBench employs detailed evaluation metrics that cover strategic planning, real-time decision-making, adaptability, and multi-agent interactions. These fine-grained metrics provide deeper quantitative analysis of agent performance, offering insights into strengths and weaknesses across different dimensions, as opposed to relying solely on traditional win-rate metrics.

🌍 Environment Overview

Game Environments

The evaluation environment for this project consisted of six dynamic, complex strategy games.

Environment	Imperfect Info	Strategic & Tactical	Dynamic space	Real-time v.s. Turn-based	More Info
StarCraft II	✔	✔	✔	Real-time	scene setting,prompt
Civilization	✔	✔	✔	Turn-based	scene setting,prompt
Street Fighter III	✘	✘	✘	Real-time	scene setting,prompt
Diplomacy	✘	✔	✘	Turn-based	scene setting,prompt
Werewolf	✔	✔	✘	Turn-based	scene setting,prompt, visual
Stratego	✔	✔	✘	Turn-based	scene setting,prompt

Repository Structure

games: This is the game environment directory, which contains 6 games, including StarCraft II, Civilization, Street Fighter III, etc. Each game uses the OpenAI Gym interface,Its main functions are as follows:

step(action): Updates the environment with an action, returns the next agent observation, the reward for taking that action, whether the environment terminated or truncated due to the latest action, and information about the step in the environment, i.e. metrics and debug information.
reset(): Resets the environment to its initial state, needed before calling step. Returns the first agent observation of an event and information.
render(): Renders the environment to help visualize what the agent sees, example modes are "human", "rgb_array", "ansi" (for text).
close(): Closes the game environment

agent_manager: This is the agent management folder, which implements the agent, prompt, and LLM interfaces of various games.

configs: This is the configuration folder, used to configure the evaluation game scene parameters, LLM parameters

agent_eval: This is the evaluation folder, the evaluation indicators are calculated.

utils: This folder is for common functions and log functions

create_yaml.py: This script generates specific configurations for multiple LLMs based on the basic configuration under configs/eval_config_base.

multiprocess_eval_tasks.py: the running script for the evaluation

🚀 Quick Start

Step 1: Prerequisites

Operating System: Windows 11 with WSL, the reason for using windows is that BLZ didn't release the latest sc2 on liunx, WSL is used to run Civilization and Street Fighter III.
Docker management: Docker Desktop, it provides a simple and intuitive way to install and use Docker on Windows, allowing developers and operators to easily build, test and run Docker containers.
python: python 3.9.6 Dependencies can be installed by running

# create virtual environment
conda create --name dsgbench python=3.9.6 
conda activate dsgbench
cd DSGBench

# install PromptCoder
git clone https://github.com/dhh1995/PromptCoder
cd PromptCoder
pip install -e . -i https://pypi.tuna.tsinghua.edu.cn/simple
cd ..

# install other pypi 
# note: if install slowly, you can user Tsinghua Source (`-i https://pypi.tuna.tsinghua.edu.cn/simple `) to speed up the installation
pip install -r pip_requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

# reinstall websocket,websocket-client
pip uninstall websocket
pip uninstall websocket-client
pip install websocket==0.2.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install websocket-client==1.8.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

Step 2: Game Setup

This section details the setup for the required games and game environments.

👉 StarCraft II

1.Install StarCraft II StatCraft II is a classic game developed by BLZ. You can download Battle.net from blizzard app

If you are Chinese, due to the Bobby Kotick, CN play cant own their sever again. So we must download StarCraft II by this video :video or you can search in the internet.

2.Maps you should put maps Ancient Cistern LE.SC2Map (games/starcraft2/Maps/Ancient Cistern LE.SC2Map) to your StarCrafrt2 file in StarCraft II\Maps(If the 'Maps' file dont exist, please create it).

👉 Civilization

Civ requires additional docker image as game engine

# 1. civ game engine
docker pull akon123456/freeciv-web
docker tag akon123456/freeciv-web:latest freeciv/freeciv-web:latest
# Start the freeciv-web service in conda environment:
conda activate dsgbench
start_freeciv_web_service
# if you need stop freeciv-web service
stop_freeciv_web_service

👉 Street Fighter III

Street Fighter III is a classic fighting game series developed by CAPCOM,in this project, we adopt DIAMBRA Arena as our game environments.

# get the DIAMBRA environment
docker pull diambra/engine:v2.2

Download this rom sfiii3n.zip manually and place it in the ./games/streetfight3/roms directory

👉 Diplomacy

Diplomacy：For running experiments with exploiter agents, you must first download the necessary weights. manually download fppi2_params.npz (link begins download) and place it in the ./games/welfare_diplomacy/welfare_diplomacy_baselines/network_parameters directory.

👉 Werewolf and Stratego

These two games do not require any additional setup or configuration.

Step 3: Configuration

3.1 LLM API key

LLM config are stored in the path "configs/llm_configs/" this project currently supports API calling, so you need configure LLM API key first.

model_name: *****           # specific model name, like "gpt-3.5-turbo-0125"
api_key: sk-*****           # API key
api_url: https://****.com   # LLM API URL

You can configure additional parameters such as max_tokens, timeout, and temperature for fine-tuning your model's performance. Supported Models:

llama-3.1-70b
deepseek-v25
gemini-1.5-flash
gpt-3.5
gpt-4o
gpt-4o-mini
o1-mini

3.2 Wandb key

We use W&B's tools to quickly track experiments and visualize results. This feature is useful for tracking large model output trajectories, but it is optional and can be configured based on your needs.

To configure your W&B key, modify the tasks_config.py file:

WANDB_API_KEY = ****your_wandb_key****

Step 4: Configure & Run Tasks

Task Configuration Example

You can configure the tasks to be evaluated in the tasks_config.py file. For instance, if you want to evaluate a specific task, add the task and its corresponding configuration file:

"eval_deepseek-v25_stratego_scene1_deepseek-v25_vs_gpt-4o-mini_random": "eval_deepseek-v25_stratego_scene1_deepseek-v25_vs_gpt-4o-mini_random.yaml"

This task will evaluate a game between two agents: a StrategoAgent and a GPT-4o-mini agent. The configuration for this task includes settings for agent models, the number of matches to be played, the output path, and the game settings.

agent:
- agent_model: LLMModel
  agent_model_config: deepseek-v25.yaml
  agent_model_config_opp: gpt-4o-mini.yaml
  agent_model_opp: LLMModel
  agent_name: StrategoAgent
  agent_prompt: StategoPrompt
  agent_prompt_opp: StategoPrompt
eval:
  num_matches: 10
  output_path: ./output/deepseek-v25/stratego/scene1/deepseek-v25_vs_gpt-4o-mini_random
  weave_prj_name: eval_deepseek-v25_stratego_scene1_deepseek-v25_vs_gpt-4o-mini_random
game:
  game_init: random
  game_mode: agent_vs_agent
  game_name: StrategoMultiAgentEnv

Explanation of Variables:

agent: Defines the two agents involved in the task. The agent_model and agent_model_opp represent the models of the two agents. Each agent has a corresponding prompt file (e.g., StrategoPrompt) that provides the context for the agent's behavior.
eval: Specifies the number of matches (num_matches) to be played, where the results will be stored (output_path), and the name of the task (weave_prj_name) for tracking the evaluation.
game: Specifies the type of game to be used for evaluation. The game is initialized in a random state (game_init: random), and the mode is set to agent_vs_agent, meaning two agents will compete against each other.

Note: Some games, such as StarCraft II and Stratego, may require more time for evaluation. Please adjust your task selection based on the time you have available.

Run tasks Once your tasks are configured, you can start running them with the following command:

diambra run --images.no-pull -r [absolute roms path of street fight III] python multiprocess_eval_tasks.py
# example
# diambra run --images.no-pull -r D:\ubuntu\jinxin_work\Agent_Eval\llm-colosseum-main\llm-colosseum-main\roms python multiprocess_eval_tasks.py

Note: This command executes the evaluation tasks with the necessary ROMs for the game Street Fighter III. Make sure to replace [absolute roms path of street fight III] with the actual path to your ROMs directory.

The run command will execute all the tasks configured in the tasks_config.py file, not just a single task. The reason the command starts with diambra run is due to compatibility with Street Fighter III, which requires this command structure. Make sure to configure your tasks properly before running.

Step 5. Calculate model ability score

After completing the tasks, you can calculate the model's ability score by running the following command:

# install dsgbench model ability calculation library
pip install dsgbench-ability-calc==0.2.0
python calc_score.py

🚀 Start with docker compose

Step 1: Prerequisites

Download the DSGBench disk file download link manually, and cat the data.img_* to data.img file

Note: The image size is over 60GB. Due to game environment compatibility issues, Linux image packaging is not currently supported. We are working to fix this and will provide Linux support as soon as possible.

Step 2: Directory Structure

|--dsgbench
    |--data                      #  the win docker disk files
        |--data.img                      #  the windows system disk files
        |--windows.*                      #  the windows system disk files
    |--docker-compose.yml        #  the docker compose files
    |--.diambra                  #  the game files of Street Fighter III

Step 3: Prepare Docker Environment

cd dsgbench

# Get the Civilization game environment
docker pull akon123456/freeciv-web
docker tag akon123456/freeciv-web:latest freeciv/freeciv-web:latest

# Get the DIAMBRA environment
docker pull diambra/engine:v2.2

# Get the Windows environment
docker pull dockurr/windows:4.14

Step 4: Create Network and Launch

# Create DSG network
docker network create --subnet=172.22.0.0/24 dsg_network

# Launch Docker Compose
docker compose up -d

# Access via browser: http://localhost:8006

Step 5: Run Evaluation Tasks

# After opening CMD window
activate dsgbench
cd C:\deploy\DSGBench

# Run tasks
python multiprocess_eval_tasks.py

# Calculate model ability score
python calc_score.py

🗂️ Result Structure

The results include two main types: wandb trace and txt result.

wandb result

Each time you run the task, the wandb project address will be output. Wandb mainly records the specific indicator trends of the agent's interaction with the environment.

txt result

|--ability_score
    |--model_ability_scores.csv       #  the final score of the model
    |--game_ability_scores.csv        #  the score of the model on each game
    |--calculation_steps.json         #  the score process
    |--ability_detail_score           #  the model's scores in different game scenarios
        |--civ.csv                    
        |--starcraft2.csv    
        |--stratego.csv    
        |--streetfight3.csv    
        |--welfare_diplomacy.csv    
        |--werewolf.csv    
|--[model_name]
    |--[game_name]
        |--[scene_name]
            |--[diff_opp]
                |--*.log              # run log, Contains the input and output of each LLM call 
                |--*.json             # match trajectory 
                |--*.csv              # result for per match

Note: The structure represents a hierarchical view of the results, with ability_score as the main folder that includes model and game-specific performance data. Each game or scene has its detailed logs and results, separated by opponent types and scenarios.

🔮 Future Work

Strategic games represent an infinite arena where LLM-based agents can truly demonstrate their potential. Through the dynamic interactions with opponents, agents not only learn and adapt but also develop increasingly complex strategies - much like how humans master the art of strategy. While this benchmark marks an important first step, there are several promising directions we envision for future exploration:

Creating a Unified Dataset for Strategic Game Scenarios
To perform in-depth strategic game evaluations, we will build a standardized trajectory dataset that covers various strategic scenarios. These datasets will provide rich training resources for trajectory-based supervised fine-tuning (SFT) and reinforcement learning (RL) feedback, contributing to more robust model development.
Developing a Hierarchical Evaluation Framework
Evaluation should not be a one-time check but rather an ongoing process. We plan to introduce a gradient hierarchical evaluation method, which gradually increases task complexity from basic tasks to more intricate scenarios. This progressive approach will enable a deeper understanding of how models adapt and evolve in diverse situations.
Implementing the Elo Rating System
The competitive nature of the games requires the integration of the Elo rating system, a dynamic ranking mechanism that more accurately reflects the relative abilities of models. This system helps avoid the biases introduced by static scoring and provides a more realistic competitive environment for model evaluation.
Developing a Unified Reinforcement Learning Framework for LLM-based Agent
We aim to develop a unified reinforcement learning framework that integrates opponent modeling and strategic decision-making, enabling agents to better understand opponents and adapt their strategies in complex game scenarios, continuously enhancing the capabilities of LLM-based agents.

🔄 Extending DSGBench

The extension specifications for DSGBench will be released soon, allowing the community to contribute new game environments and evaluation scenarios. Stay tuned!

📄 License

The DSGBench codebase is licensed under a MIT License.

TextStarCraft2: MIT License
CivRealm: GNU General Public License v3 (GPL-3.0)
welfare-diplomacy: GNU AFFERO GENERAL PUBLIC LICENSE
Stratego Env: MIT License
llm-colosseum/Street Fighter: MIT License
werewolf_arena: Apache License

🤝 Acknowledgement

This repo benefits from TextStarCraft2, CivRealm, welfare-diplomacy, Stratego Env, llm-colosseum and werewolf_arena. Thanks for their wonderful works.

📧 Contact

For questions and discussions, please contact: [email protected]

📚 Citation

@misc{tang2025dsgbenchdiversestrategicgame,
      title={DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments}, 
      author={Wenjie Tang and Yuan Zhou and Erqiang Xu and Keyan Cheng and Minne Li and Liquan Xiao},
      year={2025},
      eprint={2503.06047},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2503.06047}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
agent_eval		agent_eval
agent_manager		agent_manager
assets		assets
configs		configs
games		games
utils		utils
LICENSE		LICENSE
README.md		README.md
calc_score.py		calc_score.py
create_yaml.py		create_yaml.py
multiprocess_eval_tasks.py		multiprocess_eval_tasks.py
pip_requirements.txt		pip_requirements.txt
tasks_config.py		tasks_config.py

License

DeciBrain-Group/DSGBench

Folders and files

Latest commit

History

Repository files navigation

DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments

📖 Introduction

🌟 Highlights

🌍 Environment Overview

Game Environments

Repository Structure

🚀 Quick Start

Table of Contents

Step 1: Prerequisites

Step 2: Game Setup

👉 StarCraft II

👉 Civilization

👉 Street Fighter III

👉 Diplomacy

👉 Werewolf and Stratego

Step 3: Configuration

3.1 LLM API key

3.2 Wandb key

Step 4: Configure & Run Tasks

Task Configuration Example

Step 5. Calculate model ability score

🚀 Start with docker compose

Step 1: Prerequisites

Step 2: Directory Structure

Step 3: Prepare Docker Environment

Step 4: Create Network and Launch

Step 5: Run Evaluation Tasks

🗂️ Result Structure

wandb result

txt result

🔮 Future Work

🔄 Extending DSGBench

📄 License

🤝 Acknowledgement

📧 Contact

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages