- Implemented for discrete action space
- Basic policy gradient with monte-carlo return
- To try 2 forms of Policy gradient theorem, modify
Actor._discount_and_norm_rewards()
in policy_gradient.py- Current version: Rewards after action (equivalent one, less variance)
- Uncomment: total rewards of episode (basic version of formula)
- Use moving average of episdoes' return as basline
- Run
run_CartPole.py
to play cartPole balancing game in OpenAI gym.
- use clipped surrogate objective
- baseline given by state value approximated by critor
- PPO class to be implemented
- based on this paper