- Simple version implemented, try it on cartPole game
- Further improvement to try:
- Dueling structure of function approximate network based on this paper
- Avoid overestimation by using double Q-learning, based on this paper
- Using prioritzed experience replay.
- Implemented for discrete action space
- Basic policy gradient with monte-carlo return
- To try 2 forms of Policy gradient theorem, modify
Actor._discount_and_norm_rewards()
in policy_gradient.py- Current version: Rewards after action (equivalent one, less variance)
- Uncomment: total rewards of episode (basic version of formula)
- Use moving average of episdoes' return as basline
- Run
run_CartPole.py
to play cartPole balancing game in OpenAI gym.
- use clipped surrogate objective
- baseline given by state value approximated by critor
- based on this paper
- Use
MAX_STEPS
to control the length of episode you want to play (2000 by default) - Note that there exists an upper bound of total rewards due to
MAX_STEPS
- Next thing to be done: extend this algorithm to
- Multi-process/thread (multi-actor, single critor; less correlation between experiences)
- Continuous action selection model
Test on OpenAI gym games.
Trained agent on LunarLander task. 5000 episodes of training, see hyperparameters in experiment_results/RL_set
.
Trained agent on mountain car game, MAX_STEPS
set to 5000, hence the longest time step each episode can go on is 5000. See detailed hyperparameters setting in experiment_results/RL_set
)
Trained agent on cartPole game,
MAX_STEPS
set to 5000, hence the maximum total reward can be obtained is 5000. See detailed hyperparameters setting in experiment_results/RL_set
)