Skip to content

Commit 075733b

Browse files
committed
feat: 🐛 fix some bugs in PPO and added some figures
1 parent 97fdc55 commit 075733b

File tree

6 files changed

+24
-10
lines changed

6 files changed

+24
-10
lines changed
55.9 KB
Loading
Loading

ProximalPolicyOptimization(PPO)/model.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,11 @@ def __init__(self, n_features, n_actions) -> None:
1818
super().__init__()
1919

2020
self.net = nn.Sequential(
21-
nn.Linear(n_features, 128),
21+
nn.Linear(n_features, 256),
2222
nn.ReLU(),
23-
nn.Linear(128, 128),
23+
nn.Linear(256, 256),
2424
nn.ReLU(),
25-
nn.Linear(128, n_actions)
25+
nn.Linear(256, n_actions)
2626
)
2727

2828
def forward(self, observation: np.ndarray):
@@ -101,8 +101,9 @@ def learn(self, total_timesteps):
101101
timesteps_so_far, i, actor_loss.item(), critic_loss.item()))
102102
# Step 8 Finally end for
103103
timesteps_so_far += np.sum(batch_lens)
104+
ep_sum_rewards = [sum(rw) for rw in batch_r]
104105
self.sw.add_scalar("avg_reward", np.mean(
105-
np.concatenate(batch_r)), timesteps_so_far)
106+
ep_sum_rewards), timesteps_so_far)
106107
# the function to collect data
107108

108109
def rollout(self):

ProximalPolicyOptimization(PPO)/readme.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,13 @@
11
# PPO onpolicy 算法 - OPENAI BASELINE
22

3+
BipedalWalker-v3
4+
5+
![BipedalWalker-v3](./lossBipedalWalker-v3.png)
6+
7+
LunarLanderContinuous-v2
8+
9+
![lunarlander](./LunarLander.png)
10+
311
<https://medium.com/@eyyu/coding-ppo-from-scratch-with-pytorch-part-1-4-613dfc1b14c8>
412

513
- 主要为了解决 actor critic 训练步长的问题,在这里实现了 ppo-clip
@@ -26,6 +34,8 @@ on:与环境交互的这个 agent 就是我们要学习的 agent,off:不
2634

2735
效果很差,因为“偶尔的胜利”不足以使网络的参数完全修正,但是 offpolicy 的 dqn with per 能够多次学习成功的经验,所以对于这个 Pendulum-v1 来说,8 太行
2836

37+
对于月球车来说也是不行的,他会走向local minimum(飞天上不下来)
38+
2939
# TrustRegionPolicyOptimization
3040

3141
- TRPO 算法 (Trust Region Policy Optimization)和 PPO 算法 (Proximal Policy Optimization)都属于 MM(Minorize-Maximizatio)算法

ProximalPolicyOptimization(PPO)/test.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44
import numpy as np
55
from tensorboardX import SummaryWriter
66
if __name__ == "__main__":
7-
model_dir = "./ProximalPolicyOptimization(PPO)/saved_models/ACTOR 2021-12-14 16-45-2.pth"
8-
writer = SummaryWriter("./ProximalPolicyOptimization/logs")
7+
model_dir = "./ProximalPolicyOptimization(PPO)/saved_models/ACTOR 2022-5-20 15-54-15.pth"
8+
writer = SummaryWriter("./ProximalPolicyOptimization(PPO)/logs")
99
env = gym.make("Pendulum-v1")
1010
ppo = model.PPO(env, writer)
1111
ppo.actor.load_state_dict(t.load(model_dir))
@@ -18,6 +18,9 @@
1818
env.render()
1919
a = ppo.actor.forward(t.FloatTensor(s)).detach().numpy()
2020

21-
s_, r, _, _ = env.step(a)
21+
s_, r, done, _ = env.step(a)
2222
s = s_
2323
print("action:{},reward:{}".format(a, r))
24+
25+
if done:
26+
s = env.reset()

ProximalPolicyOptimization(PPO)/train.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
import model
33
from tensorboardX import SummaryWriter
44
writer = SummaryWriter("./ProximalPolicyOptimization(PPO)/logs")
5-
env = gym.make("LunarLanderContinuous-v2")
5+
env = gym.make("Pendulum-v1")
66
ppo = model.PPO(env, writer)
7-
ppo.learn(1000000)
8-
ppo.save_model("./ProximalPolicyOptimization/saved_models")
7+
ppo.learn(1500000)
8+
ppo.save_model("./ProximalPolicyOptimization(PPO)/saved_models")

0 commit comments

Comments
 (0)