Skip to content

Dịch on/off policy #164

Jul 1, 2022 · 1 comments · 3 replies
Discussion options

You must be logged in to vote

The Q-Learning algorithm is called an off-policy algorithm because the policy being
trained is not necessarily the one being executed: in the previous code example, the
policy being executed (the exploration policy) is completely random, while the policy
being trained will always choose the actions with the highest Q-Values. Conversely,
the Policy Gradients algorithm is an on-policy algorithm: it explores the world using
the policy being trained.

"(không) dựa trên chính sách" chăng?

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@cuongvng
Comment options

cuongvng Jul 3, 2022
Collaborator Author

@minhduc0711
Comment options

@cuongvng
Comment options

cuongvng Jul 7, 2022
Collaborator Author

Answer selected by cuongvng
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants