Skip to content

Commit 4f089a5

Browse files
committed
[docs]: upload PPO performance
1 parent c0871ca commit 4f089a5

File tree

5 files changed

+63
-34
lines changed

5 files changed

+63
-34
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
data/*
22
!data/plot_*.py
3-
3+
!data/happo_learning_curve_simple_tag_v3_s23.png
44
models/*
676 KB
Loading

HAPPO-MAPPO_Continous_Heterogeneous/data/plot_reward.py

Lines changed: 9 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
import os
44
import glob
55
import argparse
6+
import re
67

78
def plot_rewards(file_path=None, data_dir=None, show=True, save=True):
89
"""
@@ -32,7 +33,7 @@ def plot_rewards(file_path=None, data_dir=None, show=True, save=True):
3233
# If file not specified, find the latest CSV file
3334
if file_path is None:
3435
# 修改:支持自动查找最新的happo奖励文件
35-
csv_files = glob.glob(os.path.join(data_dir, "happo_rewards_*.csv"))
36+
csv_files = glob.glob(os.path.join(data_dir, "happo_rewards_simple_tag_v3_n1_s23_2025-09-24_20-12.csv"))
3637
if not csv_files:
3738
print(f"Error: No HAPPO reward CSV files found in directory {data_dir}")
3839
return
@@ -54,20 +55,12 @@ def plot_rewards(file_path=None, data_dir=None, show=True, save=True):
5455
filename = os.path.basename(file_path)
5556
parts = filename.split('_')
5657

57-
# 修改:更好的文件名解析逻辑
5858
algorithm = "HAPPO" # 默认算法名
59-
env_name = "unknown"
60-
agents = "?"
61-
seed = "?"
62-
63-
if filename.startswith("happo_rewards_"):
64-
# 格式: happo_rewards_{env_name}_n{number}_s{seed}_{timestamp}.csv
65-
try:
66-
env_name = parts[2] # simple_tag_v3
67-
agents = parts[3][1:] if parts[3].startswith('n') else parts[3] # 去掉'n'前缀
68-
seed = parts[4][1:] if parts[4].startswith('s') else parts[4] # 去掉's'前缀
69-
except IndexError:
70-
pass # 使用默认值
59+
env_name = "simple_tag_v3"
60+
# 从文件名中提取种子值
61+
seed_match = re.search(r"_s(\d+)_", filename)
62+
if seed_match:
63+
seed = seed_match.group(1)
7164

7265
# Create chart with better styling
7366
plt.figure(figsize=(12, 8))
@@ -76,7 +69,7 @@ def plot_rewards(file_path=None, data_dir=None, show=True, save=True):
7669
plt.ylabel('Evaluation Reward', fontsize=12)
7770

7871
# 修改:使用英文标题,更清晰的格式
79-
plt.title(f'{algorithm} Learning Curve | Env: {env_name} | Agents: {agents} | Seed: {seed}',
72+
plt.title(f'{algorithm} Learning Curve | Env: {env_name} | Seed: {seed}',
8073
fontsize=14, fontweight='bold')
8174
plt.grid(True, linestyle='--', alpha=0.3)
8275

@@ -118,7 +111,7 @@ def plot_rewards(file_path=None, data_dir=None, show=True, save=True):
118111
plots_dir = os.path.join(current_dir)
119112
os.makedirs(plots_dir, exist_ok=True)
120113
# 修改:更清晰的文件名
121-
plt_filename = os.path.join(plots_dir, f"{algorithm.lower()}_learning_curve_{env_name}_n{agents}_s{seed}.png")
114+
plt_filename = os.path.join(plots_dir, f"{algorithm.lower()}_learning_curve_{env_name}_s{seed}.png")
122115
plt.savefig(plt_filename, dpi=300, bbox_inches='tight')
123116
print(f"Chart saved to: {plt_filename}")
124117

README.md

Lines changed: 24 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
| [动手学强化学习](./动手学强化学习/) | ![状态](https://img.shields.io/badge/状态-参考实现-informational) | ![完成度](https://img.shields.io/badge/完成度-100%25-brightgreen) | ![技术](https://img.shields.io/badge/技术-DQN到DDPG-blue) | [README](./动手学强化学习/README.md) |
1212
| [MADDPG_Continous](./MADDPG_Continous/) | ![状态](https://img.shields.io/badge/状态-已完成-success) | ![完成度](https://img.shields.io/badge/完成度-100%25-brightgreen) | ![技术](https://img.shields.io/badge/技术-连续MADDPG-blue) | [中文文档](./MADDPG_Continous/README.md#项目特色) |
1313
| [MATD3_Continous](./MATD3_Continous/) | ![状态](https://img.shields.io/badge/状态-已完成-success) | ![完成度](https://img.shields.io/badge/完成度-100%25-brightgreen) | ![技术](https://img.shields.io/badge/技术-连续MATD3-blue) | [中文文档](./MATD3_Continous/readme.md) |
14-
14+
| [HAPPO-MAPPO_Continous_Heterogeneous](./HAPPO-MAPPO_Continous_Heterogeneous/) | ![状态](https://img.shields.io/badge/状态-已完成-success) | ![完成度](https://img.shields.io/badge/完成度-95%25-brightgreen) | ![技术](https://img.shields.io/badge/技术-PPO异构智能体-blue) | [中文文档](./HAPPO-MAPPO_Continous_Heterogeneous/Readme.md) |
1515

1616
## 学习路径与项目关联
1717
本仓库中的项目构成了一条从基础强化学习到多智能体强化学习的完整学习路径:
@@ -49,7 +49,7 @@
4949
#### 参考资源
5050
- [赵老师强化学习课程](https://www.bilibili.com/video/BV1sd4y167NS)
5151
- [强化学习的数学原理](https://github.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning)
52-
#### 代码位置 [`赵老师强化学习代码仓库: ./RL_Learning-main`](./RL_Learning-main/scripts)
52+
#### 代码位置 [`赵老师强化学习代码仓库: ./RL_Learning-main`](./RL_Learning-main/scripts)
5353

5454
#### 更新日志
5555

@@ -98,13 +98,13 @@
9898
</div>
9999

100100

101-
#### 实现进度
101+
##### 实现进度
102102
| 算法 | 状态 | 位置 | 核心组件 |
103103
|----------------|--------|----------------------|----------------------------------|
104104
| MADDPG | ✅ 1.0 | `agents/maddpg/` | MADDPG_agent, DDPG_agent, buffer |
105105
| Independent RL | ⏳ 待完成 | `agents/independent/`| IndependentRL (计划中) |
106106
| Centralized RL | ⏳ 待完成 | `agents/centralized/`| CentralizedRL (计划中) |
107-
#### 代码位置 [`./MADDPG_Continous`](./MADDPG_Continous)
107+
##### 代码位置 [`./MADDPG_Continous`](./MADDPG_Continous)
108108

109109

110110
#### 3.2 MATD3_Continous:多智能体双延迟深度确定性策略梯度算法
@@ -123,17 +123,35 @@
123123
<p><strong>MATD3算法在simple_tag_env环境中的奖励收敛曲线</strong></p>
124124
</div>
125125

126-
#### MATD3 vs MADDPG
126+
##### MATD3 vs MADDPG
127127
MATD3对标准MADDPG进行了以下关键增强:
128128

129129
1. **双Q网络设计**: 减少对动作值的过估计
130130
2. **延迟策略更新**: 提高训练稳定性
131131
3. **目标策略平滑**: 通过在目标动作中加入噪声防止过拟合
132132
4. **自适应噪声调整**: 根据训练进度动态调整探索噪声
133133

134-
#### 代码位置 [`./MATD3_Continous`](./MATD3_Continous)
134+
##### 代码位置 [`./MATD3_Continous`](./MATD3_Continous)
135+
136+
#### 3.3 MAPPO-HAPPO算法:支持同构/异构智能体的多智能体近端策略优化
137+
138+
实现了两种基于PPO的多智能体算法:MAPPO(多智能体近端策略优化)和HAPPO(异构智能体近端策略优化),为连续动作空间和异构智能体环境提供了解决方案。
139+
140+
<div align="center">
141+
<img src="./HAPPO-MAPPO_Continous_Heterogeneous/data/happo_learning_curve_simple_tag_v3_s23.png" alt="HAPPO算法表现" width="80%"/>
142+
<p><strong>HAPPO算法特点:支持异构智能体协作与竞争,每个智能体可以有不同的观察维度</strong></p>
143+
</div>
144+
145+
##### HAPPO/MAPPO的优势
135146

147+
1. **无需采用确定性策略**:基于PPO,使用随机策略,减轻过拟合
148+
2. **异构智能体支持**:HAPPO特别支持不同观察维度和能力的异构智能体
149+
3. **训练稳定性**:PPO的截断机制提供更稳定的训练过程
150+
4. **采样效率**:通过多回合更新提高样本利用效率
151+
5. **超参数鲁棒性**:对超参数选择不那么敏感
136152

153+
##### 代码位置 [`./MAPPO_Continous_Homogeneous`](./MAPPO_Continous_Homogeneous)
154+
##### 代码位置 [`./HAPPO-MAPPO_Continous_Homogeneous`](./HAPPO-MAPPO_Continous_Heterogeneous)
137155

138156
## 进行中的项目
139157
- **MARL**: 基于深度强化学习的多智能体协作与协调

README_en.md

Lines changed: 29 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ This repository contains multiple projects related to Reinforcement Learning (RL
1212
| [Hands-on RL](./动手学强化学习/) | ![Status](https://img.shields.io/badge/status-reference-informational) | ![Completion](https://img.shields.io/badge/completion-100%25-brightgreen) | ![Tech](https://img.shields.io/badge/tech-DQN%20to%20DDPG-blue) | [README](./动手学强化学习/README.md) |
1313
| [MADDPG_Continous](./MADDPG_Continous/) | ![Status](https://img.shields.io/badge/status-completed-success) | ![Completion](https://img.shields.io/badge/completion-100%25-brightgreen) | ![Tech](https://img.shields.io/badge/tech-continuous%20MADDPG-blue) | [README](./MADDPG_Continous/README_EN.md) |
1414
| [MATD3_Continous](./MATD3_Continous/) | ![Status](https://img.shields.io/badge/status-completed-success) | ![Completion](https://img.shields.io/badge/completion-100%25-brightgreen) | ![Tech](https://img.shields.io/badge/tech-continuous%20MATD3-blue) | [README](./MATD3_Continous/readme_en.md) |
15-
15+
| [HAPPO-MAPPO_Continous_Heterogeneous](./HAPPO-MAPPO_Continous_Heterogeneous/) | ![Status](https://img.shields.io/badge/status-completed-success) | ![Completion](https://img.shields.io/badge/completion-95%25-brightgreen) | ![Tech](https://img.shields.io/badge/tech-PPO%20Heterogeneous-blue) | [Documentation](./HAPPO-MAPPO_Continous_Heterogeneous/Readme_en.md) |
1616
## Learning Path and Project Connections
1717

1818
The projects in this repository form a complete learning path from basic reinforcement learning to multi-agent reinforcement learning:
@@ -52,8 +52,7 @@ Reproduction of Professor Shiyu Zhao's reinforcement learning course code from W
5252
- [Professor Zhao's Reinforcement Learning Course](https://www.bilibili.com/video/BV1sd4y167NS)
5353
- [Mathematical Foundation of Reinforcement Learning](https://github.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning)
5454

55-
#### Code Location
56-
[Professor Zhao's RL Code Repository: ./RL_Learning-main](./RL_Learning-main/scripts)
55+
#### Code Location [Professor Zhao's RL Code Repository: ./RL_Learning-main](./RL_Learning-main/scripts)
5756

5857
#### Update Log
5958
**2024.6.7**
@@ -78,8 +77,7 @@ Reproduction and expansion of the code from the book "Hands-on Reinforcement Lea
7877
#### Learning Path
7978
This section demonstrates the learning path from basic DQN to DDPG, and then to MADDPG, laying the foundation for understanding multi-agent reinforcement learning.
8079

81-
#### Code Location
82-
[./动手学强化学习](./动手学强化学习/)
80+
#### Code Location [./动手学强化学习](./动手学强化学习/)
8381

8482
#### References
8583
- [Hands-on Reinforcement Learning](https://hrl.boyuai.com/chapter/2/dqn%E7%AE%97%E6%B3%95)
@@ -107,15 +105,14 @@ Personal implementation of the MADDPG algorithm based on the latest version of t
107105
<p><strong>Reward convergence curve of MADDPG algorithm in simple_tag_v3 environment</strong></p>
108106
</div>
109107

110-
#### Implementation Progress
108+
##### Implementation Progress
111109
| Algorithm | Status | Location | Core Components |
112110
|----------------|--------|----------------------|----------------------------------|
113111
| MADDPG | ✅ 1.0 | `agents/maddpg/` | MADDPG_agent, DDPG_agent, buffer |
114112
| Independent RL | ⏳ Planned | `agents/independent/`| IndependentRL (planned) |
115113
| Centralized RL | ⏳ Planned | `agents/centralized/`| CentralizedRL (planned) |
116114

117-
#### Code Location
118-
[./MADDPG_Continous](./MADDPG_Continous)
115+
##### Code Location [./MADDPG_Continous](./MADDPG_Continous)
119116

120117
#### 3.2 MATD3_Continous: Multi-Agent Twin Delayed Deep Deterministic Policy Gradient Algorithm
121118

@@ -134,16 +131,37 @@ Multi-agent extension version of the TD3 algorithm (MATD3: Twin Delayed Deep Det
134131
<p><strong>Reward convergence curve of MATD3 algorithm in simple_tag_v3 environment</strong></p>
135132
</div>
136133

137-
#### MATD3 vs MADDPG
134+
##### MATD3 vs MADDPG
138135
MATD3 enhances standard MADDPG with these key improvements:
139136

140137
1. **Double Q-Network Design**: Reduces overestimation of action values
141138
2. **Delayed Policy Updates**: Improves training stability
142139
3. **Target Policy Smoothing**: Prevents overfitting by adding noise to target actions
143140
4. **Adaptive Noise Adjustment**: Dynamically adjusts exploration noise based on training progress
144141

145-
#### Code Location
146-
[./MATD3_Continous](./MATD3_Continous)
142+
##### Code Location [./MATD3_Continous](./MATD3_Continous)
143+
144+
145+
#### 3.3 HAPPO-MAPPO: Supporting Heterogeneous Agents in Multi-Agent Proximal Policy Optimization
146+
147+
Implementation of two PPO-based multi-agent algorithms: MAPPO (Multi-Agent Proximal Policy Optimization) and HAPPO (Heterogeneous-Agent Proximal Policy Optimization), providing solutions for continuous action spaces and heterogeneous agent environments.
148+
149+
<div align="center">
150+
<img src="./HAPPO-MAPPO_Continous_Heterogeneous/data/happo_learning_curve_simple_tag_v3_s23.png" alt="HAPPO Algorithm Performance" width="45%"/>
151+
<p><strong>HAPPO Algorithm Features: Supporting heterogeneous agent cooperation and competition, where each agent can have different observation dimensions</strong></p>
152+
</div>
153+
154+
##### Advantages of HAPPO/MAPPO
155+
156+
1. **No Need for Deterministic Policies**: Based on PPO, using stochastic policies, reducing overfitting
157+
2. **Heterogeneous Agent Support**: HAPPO specifically supports heterogeneous agents with different observation dimensions and capabilities
158+
3. **Training Stability**: PPO's clipping mechanism provides more stable training process
159+
4. **Sample Efficiency**: Improves sample utilization through multi-epoch updates
160+
5. **Hyperparameter Robustness**: Less sensitive to hyperparameter selection
161+
162+
##### Code Location [`./MAPPO_Continous_Homogeneous`](./MAPPO_Continous_Homogeneous)
163+
##### Code Location [`./HAPPO-MAPPO_Continous_Heterogeneous`](./HAPPO-MAPPO_Continous_Heterogeneous)
164+
147165

148166
## Ongoing Projects
149167
- **MARL**: Multi-agent cooperation and coordination based on deep reinforcement learning

0 commit comments

Comments
 (0)