Support ADK for RL training #791

jialei777 · 2025-07-10T16:48:02Z

jialei777
Jul 10, 2025

Is your feature request related to a problem? Please describe.

I would like to use repo to train an agent

Describe the solution you'd like

ADK seems to be the right way for agentic framework. It will be nice to support ADK during GRPO training.

Describe alternatives you've considered

Alternative approaches will inevitably introduce gap between training and inference/serving, so they are not ideal.

snowmanwwg · 2025-07-21T00:08:29Z

snowmanwwg
Jul 21, 2025
Collaborator

Hi Jialei, my brief read on ADK tells me that it is actually orthogonal to what Nemo RL does - Nemo RL is a training FW that outputs a model that has reasoning capabilities. That model can be used with ADK to build and deploy as an agent. As long as ADK is model checkpoint agnostic, it should work. Let me know if my understanding is correct.

Alternatively, what do you see as the "to-dos" for Nemo RL to support ADK?

0 replies

jialei777 · 2025-07-21T18:47:23Z

jialei777
Jul 21, 2025
Author

Thank you @snowmanwwg for the feedback. My understanding is ADK is becoming the standard way to deploy model as agent. While we can train model with Nemo RL and then deploy to ADK, there is gap as the rollout in nemo rl is different from the inference time with ADK (e.g., different way of determine stop criteria for multi-turn convo). So quality might be sub-optimal when the usecase is with ADK. This is the main motivation for the request.

As the todo items, I am thinking a 2-step procedure.

I have start prototyping with ADK implementation in feat: Enable simulated user for multi-turn GRPO #606. Basically, I am adding an ADK "environment", where the simulated user response message from each policy rollout message. This can be viewed as a simple attempt but not completely closing the gap between training and inference.
A more involved and aligned between training (rollout) and inference approach is to replace the whole multi-turn rollout logic to ADK agentic logic. This means in whole rollout loop (i.e., for turn in range(max_turn): loop) will be replaced by ADK logic (exactly the same as in inference time) and ADK returns the whole conversation for reward computation.

I understand the potential huge code change and re-design of the repo structure. That is why I want to point this out and start discussion as early as possible.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support ADK for RL training #791

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Support ADK for RL training #791

Uh oh!

jialei777 Jul 10, 2025

Replies: 2 comments

Uh oh!

snowmanwwg Jul 21, 2025 Collaborator

Uh oh!

jialei777 Jul 21, 2025 Author

jialei777
Jul 10, 2025

snowmanwwg
Jul 21, 2025
Collaborator

jialei777
Jul 21, 2025
Author