feat: Enable simulated user for multi-turn GRPO [new] #732

jialei777 · 2025-07-23T22:48:10Z

Recreating the old PR #606, since I messed up with DCO by merging main.

What does this PR do ?

Add an simple example on multi-turn GRPO using ADK.

Issues

List issues that this PR closes (syntax):

Usage

uv run python examples/run_grpo_unique_numbers_w_adk.py

Training reward:

Validation acc:

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Signed-off-by: Jialei Chen <[email protected]>

nemo_rl/environments/simulated_user/unique_numbers.py

terrykong

copy-pasted review from #606

terrykong · 2025-07-27T21:55:47Z

nemo_rl/experience/rollouts.py

-            ).input_ids[0]
-
+            # Tokenize the raw content from the environment into chat format if needed
+            env_role = env_output.observations[i]["role"].lower()


@SahilJain314 can you review this logic?

@SahilJain314 would you help to take a look at the rollout logic? Here is more discussion on why i add those logic for multi-turn case
#682 (comment)

The "<|begin_of_text|>" here seems highly specific to a particular model/tokenizer etc. Is there a way to make this general? Further, we currently handle base model training as a special case of chat-model training where the chat template is just concatenation. Does the .removeprefix here handle this case?

What do you suggest? I cannot find a good way to handle multi-turn conversation -- removing "<|begin_of_text|>" is a hacky way to make sure the conversation history is correct, please checkout discussion here #682 (comment)

@SahilJain314 updated with a custom chat template for multi-turn conversation. I think it is a super-hacky solution, but at least it works for both single turn and multi turn. Please let me know if you have better ideas.

Why wouldn't this include the BOS token?

I updated the chat template which does not add BoS by default. And the BoS is added in the first message. In this way, I won't break the single turn logic and also make sure the concatenated tokens for multi-turn is correct.

pyproject.toml

Signed-off-by: Jialei Chen <[email protected]>

jialei777 · 2025-07-31T22:27:47Z

@SahilJain314 & @terrykong would you take another look and let me know if it is good to merge?

jialei777 · 2025-08-04T16:43:49Z

@SahilJain314 & @terrykong -- gental ping

jialei777 · 2025-08-05T20:56:09Z

@terrykong according to our discussion offline, i feel this pr is good and we just need @SahilJain314 to sign off?

terrykong · 2025-08-05T22:51:04Z

yea. i'd like @SahilJain314 to bless this change for the multiturn rollout

SahilJain314 · 2025-08-07T22:22:46Z

@jialei777
Discussed with @terrykong and I think we've arrived at a reasonable solution:
The problem we've got right now is that the chat templating logic used here is specific to gemma via the config-rewrite of the chat template (making the bos_token's existence configurable). While this works for gemma, it would probably break in a lot of places for other models and also is very odd for users.

As such, this is what we've got:

do only in multiturn

x = apply_chat_template(string, add_special_String=False)
x = remove_if_exists_from_front(x, tokenizer.bos_token_id) # maybe have tokenize(tokenizer.bos_token)?

Here, we'd apply the chat_template as per default for the tokenizer, but then tokenize the bos string (or find the bos token id) and remove them from the front in a 'soft' way - that is - we check for their existence before we remove them and don't fail if they don't.

With this, we should hit all of the following:

non chat models still work (chat template exists and bos token will be removed if exists)
gemma (with non-configurable default bos chat template) works as it still specifies the bos token, so we can purge it. This is an improvement from the previous version in this PR where you had a hard coded gemma BOS token.
non-gemma: either configurable or non configurable bos token will work since we do an existence check.

With this change, we should be good to merge.

Signed-off-by: Jialei Chen <[email protected]>

jialei777 · 2025-08-11T17:12:39Z

Done. @SahilJain314 would you take another look?

Signed-off-by: Jialei Chen <[email protected]>

jialei777 · 2025-09-02T18:42:34Z

@terrykong / @SahilJain314 would you take a look at this? it has been weeks... If it is not something you are interested in, I will close the PR.

terrykong · 2025-09-04T05:21:00Z

Hi @jialei777 . Sorry for the delay on our side. I plan to take a look at your PR later this week to shepherd it in.

euronymous-aithal · 2025-09-16T17:53:31Z

@jialei777 sorry for the delay. @ahmadki to help review and merge this

ahmadki · 2025-09-22T17:55:45Z

Thank you for the PR @jialei777 , and sorry for the delay from our side.
The PR looks good to be merged, all previous comments have been addressed. But it does have several merge conflicts and failing some tests since it is two months.
To speed things up, I fixed these issues in ahmadki/simulated-user-rec - do you mind taking a look ? then feel free to either merge or pickup my changes into your PR.

Once done, I'll confirm the convergence graph you attached and we should be good to go.

jialei777 · 2025-10-06T17:26:18Z

hey @ahmadki, thank you so much for picking this up. Your change looks good, please feel free to merge.

ashors1 · 2025-10-22T18:28:42Z

@ahmadki what is the status here? If still relevant, can we try to get this PR merged?

ahmadki · 2025-10-22T21:25:59Z

Closing in favor of #1412 - should be merged as soon as I confirm the convergence graph

@jialei777 Do you mind adding a Copyright headers to the new files:

./examples/run_grpo_unique_numbers_w_adk.py
./tests/unit/environments/test_simulated_user.py
./nemo_rl/environments/simulated_user/prompt.py
./nemo_rl/environments/simulated_user/unique_numbers.py
./nemo_rl/environments/simulated_user/adk_utils.py

ahmadki · 2025-11-22T20:40:34Z

Closing again in favor of #1412

ini

38d6a69

Signed-off-by: Jialei Chen <[email protected]>

jialei777 mentioned this pull request Jul 23, 2025

feat: Enable simulated user for multi-turn GRPO #606

Closed

4 tasks

jialei777 marked this pull request as ready for review July 23, 2025 22:49

add unit test

9f86537

Signed-off-by: Jialei Chen <[email protected]>

l1nghao reviewed Jul 24, 2025

View reviewed changes

nemo_rl/environments/simulated_user/unique_numbers.py Show resolved Hide resolved

terrykong reviewed Jul 27, 2025

View reviewed changes

dependency

6c9c7b7

Signed-off-by: Jialei Chen <[email protected]>

terrykong requested a review from SahilJain314 July 28, 2025 20:55

jialei777 added 4 commits July 29, 2025 17:47

address comments

fca937e

Signed-off-by: Jialei Chen <[email protected]>

address comments

6facb1c

Signed-off-by: Jialei Chen <[email protected]>

add gemma support

b4496af

Signed-off-by: Jialei Chen <[email protected]>

typo

9e96011

Signed-off-by: Jialei Chen <[email protected]>

jialei777 requested a review from terrykong July 29, 2025 21:01

address comments

183ff48

Signed-off-by: Jialei Chen <[email protected]>

merge main

68c8027

Signed-off-by: Jialei Chen <[email protected]>

ahmadki requested review from ahmadki and removed request for SahilJain314 September 16, 2025 18:32

terrykong added the external label Sep 16, 2025

snowmanwwg added the x-google label Sep 16, 2025

euronymous-aithal added the r0.4.0 label Sep 17, 2025

github-actions bot added the community-request label Oct 6, 2025

ahmadki mentioned this pull request Oct 22, 2025

feat: Enable simulated user for multi-turn GRPO #1412

Open

4 tasks

ahmadki closed this Oct 22, 2025

ahmadki reopened this Oct 23, 2025

ahmadki requested review from a team as code owners October 23, 2025 17:21

terrykong removed the r0.4.0 label Oct 28, 2025

mrm-196 mentioned this pull request Nov 19, 2025

feat: Enable simulated user for multi-turn GRPO #1545

Closed

ahmadki closed this Nov 22, 2025

feat: Enable simulated user for multi-turn GRPO [new] #732

feat: Enable simulated user for multi-turn GRPO [new] #732

Uh oh!

Conversation

jialei777 commented Jul 23, 2025

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

terrykong Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

jialei777 Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

SahilJain314 Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

jialei777 Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

jialei777 Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SahilJain314 Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

jialei777 Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jialei777 commented Jul 31, 2025

Uh oh!

jialei777 commented Aug 4, 2025

Uh oh!

jialei777 commented Aug 5, 2025

Uh oh!

terrykong commented Aug 5, 2025

Uh oh!

SahilJain314 commented Aug 7, 2025

do only in multiturn

Uh oh!

jialei777 commented Aug 11, 2025

Uh oh!

jialei777 commented Sep 2, 2025

Uh oh!

terrykong commented Sep 4, 2025

Uh oh!

euronymous-aithal commented Sep 16, 2025

Uh oh!

ahmadki commented Sep 22, 2025

Uh oh!

jialei777 commented Oct 6, 2025

Uh oh!

ashors1 commented Oct 22, 2025

Uh oh!

ahmadki commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahmadki commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

jialei777 Jul 28, 2025 •

edited

Loading

ahmadki commented Oct 22, 2025 •

edited

Loading