This section is about policy gradient method, including simple policy gradient method and trust region policy optimization. Based on cart-v0 environment from openAI gym module, different methods are implemented using pytorch. The correspondent hyperparameters are from the correspondent algorithm paper.
Policy gradient is directly to approximate the policy function, mapping the state to the action. It has the following advantages compared to value function approximation:
- Parameterization is simpler, easy to converge
- feasible to solve when action space is huge or infinite
- random policy can ensemble exploration
of course, it has some shortcomings:
- local optimal
- big variable to evaluate single policy
The objective to optimize the expectation of rewards. By importance sampling, we want to calculate the policy gradient. Normally, random policy can be expressed as
one fixed policy + random part. We can use different functions to approximate the fixed part. Random part can be normal distribution. The loss function is as follows
One training curve is as follows
The theory of trpo can be seen in paper. Here, I want to show steps and explain the update method
- sample actions and trajectories
- calculate mean KL divergence and fisher vector product
- construct surrogate loss and line search method by conjugate gradient algorithm
Update steps:
- calculate advantage, surrogate loss, and policy gradient.
- if no gradient, return
- update theta, try to update value function, and policy network
line search:
where ei is expected improvement
One training curve in cart-v0 as follows:
From my experience in TRPO, it is sensitive to the quality of data. And based on PPO and my experiments, TRPO might perform very differently in multiple trials because it uses a hard constraint on KL divergence. Thus, in order to improve TRPO, PPO came in.
It is simpler and more stable than TRPO. Firstly, define the ratio
In order to modify the objective to penalize changes to the policy that move the ratio away from 1, clipped surrogate objective, fixed KL and adaptive KL penalty coefficient methods have been came up in the paper.
- Clipped surrogate objective
- Fixed KL and adaptive KL penalty coefficient
With respect to adaptive KL, we have d shown above, and then
The following graphs show the learning rewards,
This is especially for continuous controls using deep reinforcement learning. The main goal is to find the deterministic policy from exploratory behavior policy. From my experience, it does not converge faster than PPO, and sometimes it is sensitive to the quality of data.
Learning curve:
In order to better train the model, someone came up with the data filter to filter dirty data and get better model.