-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: noisy-net layer #97
Comments
@mysl , seems you are right. |
As a quick fix, to disable noisy_net layer one can pass policy kwarg explicitly, mind tuning entropy regularisation:
|
My understanding is that the algorithm in the paper for NoisyNet-DQN(appendix C.1), noise is sampled on every environment step. While for NoisyNet-A3C(appendix C.2), noise is sampled on every rollout batch, so in this implementation, maybe we should use a placeholder for the noise, and sample outside of the network? |
yes, but isn't that solely in context of gradient estimation (train pass)?
yes, if wee need to fix noise at time of data acquisition (see above), no if noise to be fixed for train batch only (can infer size and sample in-graph) |
I think the noise should be fixed when collecting experience as well, since A3C is an on policy algorithm. And this seems agreeing to the pseduo code (line 7) in the paper |
Yes, indeed. |
Due to time limitations expected time to fix the issue is four to five days. |
TODO checklist:btgym.algorithms.rollout.Rollout:
btgym.algorithms.policy.base.BaseAacPolicy:
btgym.algorithms.runner:
btgym.algorithms.aac.BaseAac:
|
hi @Kismuz
I was reading the paper "Noisy Network for exploration". And have a question w.r.t its usage in btgym. The paper says that "As A3C is an on-policy algorithm the gradients are unbiased when noise of the network is consistent for the whole roll-out. Consistency among action value functions is ensured by letting the noise be the same throughout each rollout"
It looks to me that in current implementation in btgym, it can't ensure "the noise is the same throughout each rollout", because the training steps and environment steps are executed in different threads, and could be interleaved. Or do I miss anythong? Thanks!
The text was updated successfully, but these errors were encountered: