title | layout | filename |
---|---|---|
Home |
template |
home |
For this project, we learned about Deep Q Learning for training an agent to complete certain tasks with reinforcement learning. Specifically, we looked into the Deep Q Network model introduced in the DQN paper by Deep Mind. To consolidated our understanding of this topic, we implemented Q learning and Deep Q learning algorithms to applications including Tic Tac Toe, Mancala, Path finding, and environments in OpenAi Gym(Mountain Car, Lunar Lander and Atari Breakout).
Reinforcement learning is a topic applicable to many, many problems in computer science and in robotics. We are interested in taking a broad survey of the field and focusing in on deep reinforcement learning. We hope to learn as much about the topic as we can in the time given, and use that knowledge to implement a deep RL agent to optimize reward in increasingly complex environments, culminating with an Atari game.
Q-Learning is a method used to map out reward spaces in a finate markov chain environment.
Bellman Equation: Source
The Q Learning agent learns an action-value function, Q, which directly approximates q⇤, the optimal action-value functionfor the environment.
At any given state, a random draw with probability epsilon is made to decide whether to take a random valid action, or to take an action that optimizes the chances of winning based on an internal graph of rewards (The Q graph). The state is then updated based on the action and the action of the opponent, and another draw is made for the new state. When the game reaches completion, a reward, based on whether the agent won or lost, is then back propagated through the graph of the prior state action pairs to map out the probability of reward given each action at each state, Q. This graph is populated by playing the game, so over time the agent gets better and better at maximizing the reward over time.
Through working on our first goal, we found that in the simple examples, we were able to store a complete q table of all state action pairs. Additionally, the environment is simple enough for the agent to reach the goal stack relatively quickly and therefore the back propagation of all history states does not take too much time. These benefits from the simple examples do not hold true for more complicated environments in the OpenAI gym which leads us to understanding the Deep Q Network model and using it to predict the q value instead of using a q table.
While Q Learning works to approximate a Q function using a NN, the Policy Gradient approach seeks to directly optimize in the policy space. Concretely, the policy gradient network directly outputs action probabilities given the current state, while Q learning outputs a likelihood of probable future reward for each action given the current state. It has been show that policy gradients work better than DQN when tuned well. Policy gradients are also considered to be more widely applicable then DQNs, especially in situations where the Q function is too complex to be learned.
John Lambert has written a great post on the math behind policy gradients, which can be found here
See Blog post 1
The implementation of a path finding agent using Q learning can be found in this Google Colab Notebook.
The implementation of training an RL agent to play mountain car using DQN can be found in this Google Colab Notebook. From tuning the DQN model, we found expanding the single state input to include 4 history states drastically improves the convergence rate.
This environment is a simulated "lunar landing," where the agent is tasked with landing a vehicle on a randomized "surface of the moon" using 3 engines. This is a standard environment in the OpenAi gym, and information about the reward schema can be found at the link above.
One important note: to speed up training time, I limited the length of episodes from 1000 frames to 400 frames. This is still plenty of time for the agent to land, but cuts down on unproductive time spent hovering above the lunar surface.
The lunar lander policy network is a feed forward network made up of a num_input->16 linear layer, a rectified linear unit layer, a 16->num_output linear layer and finally a softmax function. The network is reinforced using a Adam optimization function, with loss calculated as the reward|(action, state) * P(action|state).
Hyperparameters are as follows:
Learning rate: 0.001
Gamma: 0.999
Batch size: 2
This agent converged to a reward of ~100 after roughly 9000 episodes. Further training did improve the model, but only slightly. Below is a gif of the agent landing the vehicle after 21000 training episodes; for more discussion on the results of this model, see blog post 2
Atari Brickbreaker is the classic example from Deepmind's seminal 2013 paper on Deep Reinforcement Learning. The game is ubiquidous; Google even turned their image search into an atari breakout game for the it's 37th anniversary (which happened to coincide with the year the Deepmind paper was published). We attempted to train a policy gradient agent to play this classic game.
The first step in this model is preprocessing of the images from the OpenAi gym emulator. The Atari game provides a 210 × 160 pixel image with full color, which we greyscale and downsample to 105x80. Next we crop the image to 80x80. 4 sequential images are stacked and used as input for the policy network. The Atari policy network is the same as the one used in Deepmind's paper. From the paper: "The first hidden layer convolves 16 8 × 8 filters with stride 4 with the input image and applies a rectifier nonlinearity. The second hidden layer convolves 32 4 × 4 filters with stride 2, again followed by a rectifier nonlinearity. The final hidden layer is fully-connected and consists of 256 rectifier units. The output layer is a fullyconnected linear layer with a single output for each valid action." We have added on a Softmax function as a final step.
Unfortunately, we haven't been able to get this agent to train. One downside to Deep Reinforcement learning is the sensitivity of models to hyperparameters, and we haven't been able to find a combination of parameters that work well. I've found that too high of a learning rate results in a unstable model whose loss function exponentially increases, but too low of a learning rate doesn't seem to minimize the loss function over time. Below is a plot of the reward of the model with a learning rate of 0.00003; even given 8 hours to train over ~32000 episodes, the model doesn't improve at all.
One of our implementation of training RL agent to play Breakout with DQN can be viewed in this Google Collab Notebook. After training for 12000 episodes before we ran out of RAM in Google Colab, the best score the agent has achieved is 3 per life. This is shown in the figure below.
Even though the score of the agent playing breakout didn't improve at all after 12000 episodes, the average of the maximum q values for each episode did converge. This is shown in the figure below.
This is a similar behavior as described in the DQN paper. In the original paper. In the paper, the model was able to improve the agent's score slowly and the q value convergence is much more quickly and consistently. In the paper, they have trained the agent for 10 million frames which is also much larger than our training duration. It would be interesting to see as a next step if we will get similar result on the score after continuing training the model for similar duration as the paper,
This was a fascinating project, and we both took away a lot from it. We found deep reinforcement learning to be quite a diifcult task to implement well, even given the great tools at our disposal. At the same time, it's quite evident that deep rl is an incredibly powerful tool.
We plan to continue to work on the Atari implementation with both policy gradient and DQN approaches for the foreseable future; a working agent would be a great achievement. We're also interested in creating other robotic simmulations in the OpenAi gym framework to train further models.