Skip to content

Latest commit

 

History

History
40 lines (25 loc) · 1.57 KB

README.md

File metadata and controls

40 lines (25 loc) · 1.57 KB

🇦 🇻 🇱 🇪 🇦 🇷 🇳 🤖

Advantage (A) and State-Value (V) functions learning

This is an alternative method to Q-learning, where a state-action value function Q is no longer considered as it is, and instead, a sum of advantage value function A and a state value function V is used (which is basically Q = A + V). Some may think it is superfluous, but we will try experimenting first and see where it gets us.

Also, please note that the only similarity with the work is in representing Q as A + V. The update (learning) algorithm is different here.

The code will be here soon!

gymnasium [MuJoCo]

Ant-V4

Trained for 3 000 000 timesteps

AV ( Ant-V4) SAC (Ant-V1) TD3 (Ant-V1)
7200 6000 6000
Ant-v4.mp4

HalfCheetah-V4

Trained for 3 000 000 timesteps

AV (HalfCheetah-V4) SAC (HalfCheetah-V1) TD3 (HalfCheetah-V1)
17000 16000 12000
HalfCheetah-v4.mp4
HalfCheetah-v4-r2.mp4

Swimmer-V4

Trained for 300 000 timesteps

AV (Swimmer-V4) SAC (Swimmer-V0) TD3 (Swimmer-V0)
270 40 40
Swimmer-v4.mp4