Skip to content

Conversation

@jialei777
Copy link

Recreating the old PR #606, since I messed up with DCO by merging main.

What does this PR do ?

Add an simple example on multi-turn GRPO using ADK.

Issues

List issues that this PR closes (syntax):

Usage

uv run python examples/run_grpo_unique_numbers_w_adk.py

Training reward:
image
Validation acc:
image

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Signed-off-by: Jialei Chen <[email protected]>
@jialei777 jialei777 marked this pull request as ready for review July 23, 2025 22:49
Signed-off-by: Jialei Chen <[email protected]>
Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-pasted review from #606

).input_ids[0]

# Tokenize the raw content from the environment into chat format if needed
env_role = env_output.observations[i]["role"].lower()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SahilJain314 can you review this logic?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SahilJain314 would you help to take a look at the rollout logic? Here is more discussion on why i add those logic for multi-turn case
#682 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "<|begin_of_text|>" here seems highly specific to a particular model/tokenizer etc. Is there a way to make this general? Further, we currently handle base model training as a special case of chat-model training where the chat template is just concatenation. Does the .removeprefix here handle this case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you suggest? I cannot find a good way to handle multi-turn conversation -- removing "<|begin_of_text|>" is a hacky way to make sure the conversation history is correct, please checkout discussion here #682 (comment)

Copy link
Author

@jialei777 jialei777 Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SahilJain314 updated with a custom chat template for multi-turn conversation. I think it is a super-hacky solution, but at least it works for both single turn and multi turn. Please let me know if you have better ideas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't this include the BOS token?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the chat template which does not add BoS by default. And the BoS is added in the first message. In this way, I won't break the single turn logic and also make sure the concatenated tokens for multi-turn is correct.

Signed-off-by: Jialei Chen <[email protected]>
@terrykong terrykong requested a review from SahilJain314 July 28, 2025 20:55
Signed-off-by: Jialei Chen <[email protected]>
Signed-off-by: Jialei Chen <[email protected]>
Signed-off-by: Jialei Chen <[email protected]>
Signed-off-by: Jialei Chen <[email protected]>
@jialei777 jialei777 requested a review from terrykong July 29, 2025 21:01
@jialei777
Copy link
Author

@SahilJain314 & @terrykong would you take another look and let me know if it is good to merge?

@jialei777
Copy link
Author

@SahilJain314 & @terrykong -- gental ping

@jialei777
Copy link
Author

@terrykong according to our discussion offline, i feel this pr is good and we just need @SahilJain314 to sign off?

@terrykong
Copy link
Contributor

yea. i'd like @SahilJain314 to bless this change for the multiturn rollout

@SahilJain314
Copy link
Contributor

@jialei777
Discussed with @terrykong and I think we've arrived at a reasonable solution:
The problem we've got right now is that the chat templating logic used here is specific to gemma via the config-rewrite of the chat template (making the bos_token's existence configurable). While this works for gemma, it would probably break in a lot of places for other models and also is very odd for users.

As such, this is what we've got:

do only in multiturn

x = apply_chat_template(string, add_special_String=False)
x = remove_if_exists_from_front(x, tokenizer.bos_token_id) # maybe have tokenize(tokenizer.bos_token)?

Here, we'd apply the chat_template as per default for the tokenizer, but then tokenize the bos string (or find the bos token id) and remove them from the front in a 'soft' way - that is - we check for their existence before we remove them and don't fail if they don't.

With this, we should hit all of the following:

  • non chat models still work (chat template exists and bos token will be removed if exists)
  • gemma (with non-configurable default bos chat template) works as it still specifies the bos token, so we can purge it. This is an improvement from the previous version in this PR where you had a hard coded gemma BOS token.
  • non-gemma: either configurable or non configurable bos token will work since we do an existence check.

With this change, we should be good to merge.

Signed-off-by: Jialei Chen <[email protected]>
@jialei777
Copy link
Author

Done. @SahilJain314 would you take another look?

Signed-off-by: Jialei Chen <[email protected]>
@jialei777
Copy link
Author

@terrykong / @SahilJain314 would you take a look at this? it has been weeks... If it is not something you are interested in, I will close the PR.

@terrykong
Copy link
Contributor

Hi @jialei777 . Sorry for the delay on our side. I plan to take a look at your PR later this week to shepherd it in.

@euronymous-aithal
Copy link
Contributor

@jialei777 sorry for the delay. @ahmadki to help review and merge this

@ahmadki ahmadki requested review from ahmadki and removed request for SahilJain314 September 16, 2025 18:32
@ahmadki
Copy link
Member

ahmadki commented Sep 22, 2025

Thank you for the PR @jialei777 , and sorry for the delay from our side.
The PR looks good to be merged, all previous comments have been addressed. But it does have several merge conflicts and failing some tests since it is two months.
To speed things up, I fixed these issues in ahmadki/simulated-user-rec - do you mind taking a look ? then feel free to either merge or pickup my changes into your PR.

Once done, I'll confirm the convergence graph you attached and we should be good to go.

@jialei777
Copy link
Author

hey @ahmadki, thank you so much for picking this up. Your change looks good, please feel free to merge.

@ashors1
Copy link
Contributor

ashors1 commented Oct 22, 2025

@ahmadki what is the status here? If still relevant, can we try to get this PR merged?

@ahmadki
Copy link
Member

ahmadki commented Oct 22, 2025

Closing in favor of #1412 - should be merged as soon as I confirm the convergence graph

@jialei777 Do you mind adding a Copyright headers to the new files:

./examples/run_grpo_unique_numbers_w_adk.py
./tests/unit/environments/test_simulated_user.py
./nemo_rl/environments/simulated_user/prompt.py
./nemo_rl/environments/simulated_user/unique_numbers.py
./nemo_rl/environments/simulated_user/adk_utils.py

@ahmadki
Copy link
Member

ahmadki commented Nov 22, 2025

Closing again in favor of #1412

@ahmadki ahmadki closed this Nov 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants