-
Notifications
You must be signed in to change notification settings - Fork 171
feat: Enable simulated user for multi-turn GRPO [new] #732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Jialei Chen <[email protected]>
Signed-off-by: Jialei Chen <[email protected]>
terrykong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy-pasted review from #606
| ).input_ids[0] | ||
|
|
||
| # Tokenize the raw content from the environment into chat format if needed | ||
| env_role = env_output.observations[i]["role"].lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SahilJain314 can you review this logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SahilJain314 would you help to take a look at the rollout logic? Here is more discussion on why i add those logic for multi-turn case
#682 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "<|begin_of_text|>" here seems highly specific to a particular model/tokenizer etc. Is there a way to make this general? Further, we currently handle base model training as a special case of chat-model training where the chat template is just concatenation. Does the .removeprefix here handle this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you suggest? I cannot find a good way to handle multi-turn conversation -- removing "<|begin_of_text|>" is a hacky way to make sure the conversation history is correct, please checkout discussion here #682 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SahilJain314 updated with a custom chat template for multi-turn conversation. I think it is a super-hacky solution, but at least it works for both single turn and multi turn. Please let me know if you have better ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't this include the BOS token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the chat template which does not add BoS by default. And the BoS is added in the first message. In this way, I won't break the single turn logic and also make sure the concatenated tokens for multi-turn is correct.
Signed-off-by: Jialei Chen <[email protected]>
Signed-off-by: Jialei Chen <[email protected]>
Signed-off-by: Jialei Chen <[email protected]>
Signed-off-by: Jialei Chen <[email protected]>
Signed-off-by: Jialei Chen <[email protected]>
|
@SahilJain314 & @terrykong would you take another look and let me know if it is good to merge? |
|
@SahilJain314 & @terrykong -- gental ping |
|
@terrykong according to our discussion offline, i feel this pr is good and we just need @SahilJain314 to sign off? |
|
yea. i'd like @SahilJain314 to bless this change for the multiturn rollout |
|
@jialei777 As such, this is what we've got: do only in multiturnx = apply_chat_template(string, add_special_String=False) Here, we'd apply the chat_template as per default for the tokenizer, but then tokenize the bos string (or find the bos token id) and remove them from the front in a 'soft' way - that is - we check for their existence before we remove them and don't fail if they don't. With this, we should hit all of the following:
With this change, we should be good to merge. |
Signed-off-by: Jialei Chen <[email protected]>
|
Done. @SahilJain314 would you take another look? |
Signed-off-by: Jialei Chen <[email protected]>
|
@terrykong / @SahilJain314 would you take a look at this? it has been weeks... If it is not something you are interested in, I will close the PR. |
|
Hi @jialei777 . Sorry for the delay on our side. I plan to take a look at your PR later this week to shepherd it in. |
|
@jialei777 sorry for the delay. @ahmadki to help review and merge this |
|
Thank you for the PR @jialei777 , and sorry for the delay from our side. Once done, I'll confirm the convergence graph you attached and we should be good to go. |
|
hey @ahmadki, thank you so much for picking this up. Your change looks good, please feel free to merge. |
|
@ahmadki what is the status here? If still relevant, can we try to get this PR merged? |
|
Closing in favor of #1412 - should be merged as soon as I confirm the convergence graph @jialei777 Do you mind adding a Copyright headers to the new files: |
|
Closing again in favor of #1412 |
Recreating the old PR #606, since I messed up with DCO by merging main.
What does this PR do ?
Add an simple example on multi-turn GRPO using ADK.
Issues
List issues that this PR closes (syntax):
Usage
Training reward:


Validation acc:
Before your PR is "Ready for review"
Pre checks:
Additional Information