Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Some questions about the PPO_lag algorithm #348

Open
3 tasks done
tjruan opened this issue Aug 16, 2024 · 1 comment
Open
3 tasks done

[Question] Some questions about the PPO_lag algorithm #348

tjruan opened this issue Aug 16, 2024 · 1 comment
Labels
question Further information is requested

Comments

@tjruan
Copy link

tjruan commented Aug 16, 2024

Required prerequisites

Questions

Hello Omnisafe team, thank you very much for your contribution. I am experiencing some confusion and I hope you can answer it for me, I appreciate it!
The original PPO algorithm uses the CLIP objective function approach. In the documentation it is mentioned that the surrogate loss function for the PPOlag algorithm is:
Snipaste_2024-08-16_12-12-44
Does this equation represent an advantage function that combines rewards and costs?
Is this L in PPOlag replacing A(s, a) in PPO?
Snipaste_2024-08-16_12-18-35
If so, can you point me to where in the code the clip is in PPOlag?
I would greatly appreciate it if you could answer my questions.

@tjruan tjruan added the question Further information is requested label Aug 16, 2024
@Gaiejj
Copy link
Member

Gaiejj commented Aug 18, 2024

Does this equation represent an advantage function that combines rewards and costs?

Yes, this is an objective function that considers both reward advantage and cost advantage simultaneously.

Is this L in PPOlag replacing A(s, a) in PPO?

Sure, it is right.

If so, can you point me to where in the code the clip is in PPOlag?

The implementation of PPOLag simply replaces ( A(s, a) ) with the objective function weighted by the Lagrange multiplier. Therefore, the clipping operation can be found in the PPO implementation of omnisafe, specifically in lines 69 to 76 of the loss calculation function in PPO:

        ratio = torch.exp(logp_ - logp)
        ratio_cliped = torch.clamp(
            ratio,
            1 - self._cfgs.algo_cfgs.clip,
            1 + self._cfgs.algo_cfgs.clip,
        )
        loss = -torch.min(ratio * adv, ratio_cliped * adv).mean()
        loss -= self._cfgs.algo_cfgs.entropy_coef * distribution.entropy().mean()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants