This is an alternative method to Q-learning, where a state-action value function Q is no longer considered as it is, and instead, a sum of advantage value function A and a state value function V is used (which is basically Q = A + V). Some may think it is superfluous, but we will try experimenting first and see where it gets us.
Also, please note that the only similarity with the work is in representing Q as A + V. The update (learning) algorithm is different here.
The code will be here soon!
Trained for 3 000 000 timesteps
AV ( Ant-V4) | SAC (Ant-V1) | TD3 (Ant-V1) |
---|---|---|
7200 | 6000 | 6000 |
Ant-v4.mp4
Trained for 3 000 000 timesteps
AV (HalfCheetah-V4) | SAC (HalfCheetah-V1) | TD3 (HalfCheetah-V1) |
---|---|---|
17000 | 16000 | 12000 |
HalfCheetah-v4.mp4
HalfCheetah-v4-r2.mp4
Trained for 300 000 timesteps
AV (Swimmer-V4) | SAC (Swimmer-V0) | TD3 (Swimmer-V0) |
---|---|---|
270 | 40 | 40 |