Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Local variable 'values' not updated in the callback for the last timestep #864

Closed
3 tasks done
rajcscw opened this issue Apr 15, 2022 · 6 comments · Fixed by #1660
Closed
3 tasks done

[Bug] Local variable 'values' not updated in the callback for the last timestep #864

rajcscw opened this issue Apr 15, 2022 · 6 comments · Fixed by #1660
Labels
bug Something isn't working

Comments

@rajcscw
Copy link

rajcscw commented Apr 15, 2022

🐛 Bug

In the rollout collection of PPO, the callback references of local variable values is not updated after the value for the last timestep has been predicted.

values = self.policy.predict_values(obs_as_tensor(new_obs, self.device))

Due to this, at the end of the rollout when using callback.on_rollout_end(), locals["values"] may not match with the actual final values of the local variable values

This might lead to undesirable effects for users relying on the last values in the callback.

One simple fix is to update the locals after the self.policy.predict_values(..) by simply calling callback.update_locals(locals()) again as shown below. Further, it is also a better not to replace values and rather use a new variable last_values or next_values to distinguish that it refers to predicted values corresponding to new_obs where as values previously correspond to obs (just to keep in sync with other variables rewards, actions, infos etc)

 with th.no_grad():
    # Compute value for the last timestep
    next_values = self.policy.predict_values(
        obs_as_tensor(new_obs, self.device))

   # Update callback's locals
   callback.update_locals(locals())

rollout_buffer.compute_returns_and_advantage(last_values=values, dones=dones)

callback.on_rollout_end()
  

Expected behavior

Local variables inside the callback function are in sync with the actual local variables.

System Info

OS: Linux-3.10.0-1160.45.1.el7.x86_64-x86_64-with-debian-buster-sid #1 SMP Wed Oct 13 17:20:51 UTC 2021
Python: 3.7.11
Stable-Baselines3: 1.4.0
PyTorch: 1.10.2+cu102
GPU Enabled: True
Numpy: 1.21.5
Gym: 0.19.0

Checklist

  • I have checked that there is no similar issue in the repo (required)
  • I have read the documentation (required)
  • I have provided a minimal working example to reproduce the bug (required)
@rajcscw rajcscw added the bug Something isn't working label Apr 15, 2022
@Miffyli
Copy link
Collaborator

Miffyli commented Apr 15, 2022

Hey! Thanks for raising this. Next time remember to include minimal code to reproduce things ;)

I agree with the fix, although I would not change the variable names, in case somebody already used these variables in the past (also the last_values=values should be updated). @araffin any callback specific comments on this?

@araffin
Copy link
Member

araffin commented Apr 15, 2022

Hello,

This might lead to undesirable effects for users relying on the last values in the callback.

I'm curious to know more about your use-case. Callback may not be the appropriate way.

Otherwise, yes calling callback.update_locals(locals()) should solve the issue.

@rajcscw
Copy link
Author

rajcscw commented Apr 15, 2022

@Miffyli @araffin Thanks for getting back. Just to be clear. We can just introduce a new variable next_values for the line self.policy.predict_values(obs_as_tensor(new_obs, self.device)) and then update the locals in the callback. So, therefore values will not be changed and users who used this in the past are not affected.

Regarding my Use-case: Once the rollout is completed, I have to re-adjust the rewards in the rollout buffer and re-compute the advantages and returns (for this, I need last_dones and last_values which I can grab from callback.locals). Only way I can do this without modifying code on SB3 is to include this logic inside callback._on_rollout_end() right or is there another way?

@araffin
Copy link
Member

araffin commented Apr 15, 2022

that's what i thought, you should fork SB3 for your usecase, otherwise you would compute the advantage twice and will have hacks with side effects here and there if you use callbacks.
Or if you just need to adjust rewards, you can use a gym wrapper.

@rajcscw
Copy link
Author

rajcscw commented Apr 15, 2022

True, I agree using callbacks is kinda hacky. However, I think that is the only option for me if I do not want to fork SB3 because I work on custom envs that are compatible with Sb3 out of the box (so can plug-in different algorithms easily). Also, there is a cyclic dependency between env (reward ) and policy since policy is required to adjust the rewards. Since I use SubProcEnv for parallel rollouts, it is not efficient to send the policy to env (reward function) frequently due to communication overhead.

Therefore, callbacks is the perfect place where I have access to policy, env and can adjust the rewards, advantages etc.

So while working on this, I noticed that the values were out of sync (the problem I mentioned above). It would be sufficient just to callback.update_locals(locals()) before callback.on_rollout_end(). What do you think @Miffyli @araffin ? I can open a PR for this.

@araffin
Copy link
Member

araffin commented Apr 17, 2022

If you use callback for accessing both model and env, this is the same as forking, but more hacky...

Since I use SubProcEnv for parallel rollouts, it is not efficient to send the policy to env (reward function) frequently due to communication overhead.

You could do that in the collect_rollout() then.

hat do you think @Miffyli @araffin ? I can open a PR for this.

Please do =) (should be a one line change, please don't forget the changelog)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants