Policy Gradient Algorithm

This section is about policy gradient method, including simple policy gradient method and trust region policy optimization. Based on cart-v0 environment from openAI gym module, different methods are implemented using pytorch. The correspondent hyperparameters are from the correspondent algorithm paper.

Vanilla Policy gradient theory

Policy gradient is directly to approximate the policy function, mapping the state to the action. It has the following advantages compared to value function approximation:

Parameterization is simpler, easy to converge
feasible to solve when action space is huge or infinite
random policy can ensemble exploration

of course, it has some shortcomings:

local optimal
big variable to evaluate single policy

The objective to optimize the expectation of rewards. By importance sampling, we want to calculate the policy gradient. Normally, random policy can be expressed as

one fixed policy + random part. We can use different functions to approximate the fixed part. Random part can be normal distribution. The loss function is as follows

One training curve is as follows

Trust Region Policy Optimization (TRPO)

The theory of trpo can be seen in paper. Here, I want to show steps and explain the update method

sample actions and trajectories
calculate mean KL divergence and fisher vector product
construct surrogate loss and line search method by conjugate gradient algorithm

Update steps:

calculate advantage, surrogate loss, and policy gradient.
if no gradient, return
update theta, try to update value function, and policy network

line search:

where ei is expected improvement

One training curve in cart-v0 as follows:

From my experience in TRPO, it is sensitive to the quality of data. And based on PPO and my experiments, TRPO might perform very differently in multiple trials because it uses a hard constraint on KL divergence. Thus, in order to improve TRPO, PPO came in.

Proximal Policy Optimization

It is simpler and more stable than TRPO. Firstly, define the ratio

In order to modify the objective to penalize changes to the policy that move the ratio away from 1, clipped surrogate objective, fixed KL and adaptive KL penalty coefficient methods have been came up in the paper.

Clipped surrogate objective

where

Fixed KL and adaptive KL penalty coefficient

With respect to adaptive KL, we have d shown above, and then

The following graphs show the learning rewards,

Actor-Critic and Deep Deterministic Policy Gradient

This is especially for continuous controls using deep reinforcement learning. The main goal is to find the deterministic policy from exploratory behavior policy. From my experience, it does not converge faster than PPO, and sometimes it is sensitive to the quality of data.

Learning curve:

In order to better train the model, someone came up with the data filter to filter dirty data and get better model.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
__pycache__		__pycache__
pg_agent		pg_agent
pictures		pictures
README.md		README.md
ddpg.py		ddpg.py
main.py		main.py
model.py		model.py
ppo.py		ppo.py
trpo.py		trpo.py
util.py		util.py
vanilla_policy_gradient.py		vanilla_policy_gradient.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Policy Gradient Algorithm

Vanilla Policy gradient theory

Trust Region Policy Optimization (TRPO)

Proximal Policy Optimization

Actor-Critic and Deep Deterministic Policy Gradient

Reference link:

About

Releases

Packages

Languages

jingw2/policy_gradient

Folders and files

Latest commit

History

Repository files navigation

Policy Gradient Algorithm

Vanilla Policy gradient theory

Trust Region Policy Optimization (TRPO)

Proximal Policy Optimization

Actor-Critic and Deep Deterministic Policy Gradient

Reference link:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages