It's an off-policy method, since behaves with exploring policy and evaluates the greedy policy. Below I discuss the per-decision importance sampling version of the algorithm since it should be with lesser variance compared to the vanilla off-policy per-episode importance sampling.
The update:
In the above, if
Since
to avoid division by zero errors in the incremental implementation of the algorithm. This also has the interpretation that these trajectories are still pretty unlikely under
Based on the update above, the first
As
Finally, since in Q-learning policy evaluation overlaps with policy improvement, it is possible to get maximisation bias. This can be fixed by doing double learning - using two value functions. Then we alternate updates in which we move the value at a state-action pair of one function towards the value of the other function at that same pair. The action selection is done according to the first function.
Intuitively, you are updating your beliefs about your favourite action,
Mathematically, given two value functions
and then you alternate this and the analogous version for
Based on the Sutton and Barto book experiment with DynaQ - the gridworld example. Q values always initialised at 0, rewards are 0 for all transitions except when reaching the goal state. The start is always initialised at (2, 0), the discount is 0.95, the learning rate is 0.1, epsilon is 0.1. In the below ascii diagrams, "." represents a state whose values haven't been updated, the arrows "<, >, v, ^" correspond to the actions to go west, east, south and north respectively; "|" corresponds to a wall-state and "G" is the goal. Unless stated explicitly all models train for 100 episodes.
This corresponds to vanilla Q-learning. The code to do the experiment is:
$ python n_step/n_step_qlearning.py num_steps=1 seed=0
The learned greedy policy is:
. . . . . . . | G
v . | v < . . | ^
v < | v . v v | ^
v < | > > > > > ^
> > > ^ ^ | ^ ^ <
> ^ ^ < . > > ^ .
Evaluating the policy with start at (2, 0) we get:
. . . . . . . | G
. . | . . . . | ^
v . | . . . . | ^
v . | > > > > > ^
> > > ^ . | . . .
. . . . . . . . .
Similar to the experiments in the book, 1-step Q-learning settles on an optimal policy after 21 episodes. Note that, different seeds may change the final found path, with seed=0
it happened to find an optimal path, but since the Q-learning guarantees are asymptotic in general, we cannot guarantee convergence to the optimal policy in finite number of steps.
The episode length vs number of episodes plot is:
This is the 10 step Q-learning algorithm invoked with:
$ python n_step/n_step_qlearning.py num_steps=10 seed=0
The learned policy is:
> > > v v . . | G
^ < | v v v < | ^
^ v | > > v < | ^
^ < | ^ > > > > ^
> ^ < v < | ^ < <
> > > > > > > ^ .
Here we see that more state were esxplored and had their values updated. The convergence to a path is also quicker, taking only 7 episodes. The path on which the algorithm settled though, is not an optimal path:
> > > v . . . | G
^ . | v . . . | ^
^ . | > > v . | ^
. . | . . > > > ^
. . . . . | . . .
. . . . . . . . .
The reasons for converging on a wrong path in a finite number of steps can be because:
- We used a finite number of steps and didn't decay the learning rate for Robbins-Monro to hold.
- I lower bounded the importance weights.
Another issue may be with overestimation bias which could, in theory, be fixed by double learning (TODO).
The episode lengths versus number of episodes plot is shown below: