Fix crash in greedy assisted generation with different tokenizers#46936
Open
Sunt-ing wants to merge 1 commit into
Open
Fix crash in greedy assisted generation with different tokenizers#46936Sunt-ing wants to merge 1 commit into
Sunt-ing wants to merge 1 commit into
Conversation
Greedy Universal Assisted Generation (do_sample=False with the main and assistant models on different tokenizers) crashed with a tensor-size RuntimeError. The assistant re-encodes the prompt to a different length, but the main model position_ids were inherited unchanged into the assistant first draft round, mismatching its input length. Drop the inherited position_ids before the assistant generates, next to the existing attention_mask pop, so the assistant rebuilds them from its own input. Greedy assisted decoding stays lossless versus plain greedy.
Contributor
|
CI Dashboard: View test results in Grafana |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Universal Assisted Generation with a greedy main model (
do_sample=False, main and assistant on different tokenizers) crashes with a shape mismatch:generate(..., do_sample=False)with different tokenizers routes toAssistedCandidateGeneratorDifferentTokenizers. Itsassistant_kwargsinherit the main model's kwargs, including aposition_idssized to the main tokenizer's length. The assistant re-encodes the prompt with its own tokenizer, usually to a different length, but that inheritedposition_idsis passed straight into the assistant's draft round unchanged. An absolute-position assistant (GPT-2) then crashes for any length mismatch, and a rotary assistant (Llama) crashes whenever its re-encoding is longer than the main sequence. The generator already pops the inheritedattention_maskfor this exact reason, but never theposition_ids.The fix drops the inherited
position_idsbefore the assistant generates, right next to the existingattention_maskpop that handles the identical main-length-mismatch problem. The assistant then rebuildsposition_idsfrom its own input. Assisted decoding stays lossless: greedy UAG output is now token-identical to plain greedy.Reproduction (CPU, real tiny checkpoints) and before/after
Before this PR:
After this PR: all of the above run, and greedy UAG output is token-identical to plain greedy across every case. Reverting the fix brings the crash back.
ruff checkandruff formatare clean.Who can review?
@gante