Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize rewards manager #221

Closed
wants to merge 5 commits into from

Conversation

lorenwel
Copy link
Contributor

@lorenwel lorenwel commented Jan 29, 2024

Description

Generalizes the reward manager to allow for multiple "reward groups" made of multiple terms instead of a single one.
Is backwards compatible with the previous setup, if just reward terms are added to the config.

Fixes #220

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update (Maybe, not sure)

Checklist

  • I have run the pre-commit checks with ./orbit.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have run all the tests with ./orbit.sh --test and they pass
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@Mayankm96 Mayankm96 added the enhancement New feature or request label Jan 29, 2024
@Mayankm96
Copy link
Contributor

Mayankm96 commented Jan 31, 2024

Thanks for this MR. Will review this in the following days and add feedback. Probably also want to do this for termination manager.

The main blocking question is that with this, do we expect that the Env class returns a dict of rewards? Or it should still return a single tensor, and we put all the miscellaneous groups to extras? The latter seems to be closer to Gymnasium definitions.

@lorenwel
Copy link
Contributor Author

lorenwel commented Feb 7, 2024

The way I implemented this at the moment, is that it will return a single Tensor if there are no reward groups and a dict of tensors when there are multiple.
The reason I chose this, is because it might be hard to eventually decide which is the "main reward". In the case of CMDP with a single reward and cost this might be easy, but for things like multi-critic, where you have a dedicated critic network per reward term, it is unclear how this should be split up.

I'm not sure how much downstream complexity this would involve for the multi-critic case in learning frameworks other than rsl_rl. There, it was a fairly straight-forward modification to the OnPolicyRunner to move all Tensors in the dict to GPU/CPU (plus some logging stuff).

Generally, to me it is always a bit confusing when "major" information for learning is put into the "extras" dictionary. Then it suddenly becomes a core thing, depending on the situation.
It's already like that with the observations, which I find confusing, but that's a separate issue.

@lorenwel
Copy link
Contributor Author

lorenwel commented Jun 7, 2024

@Fe1777 Thank you for the updated PR description.
@Mayankm96 Any update on whether this is something you are still interested in and if so, if the architecture is fine. I also didn't get a response in #220 since January.

We've been using this internally since then and it works fine for us.
I'm happy to resolve the conflicts, if you are generally fine with the architecture.

@lorenwel
Copy link
Contributor Author

Superseded by #729

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Proposal] RewardManager cannot handle multiple critics
2 participants