llama-server : implement universal assisted decoding #12635

g2mt · 2025-03-28T23:17:43Z

This pull request implements universal assisted decoding in llama-server. This is a method for performing speculative decoding given a draft model whose tokenizer is incompatible with the main model, by decoding and reencoding the generated text between the two models.

It currently works, but some improvements can be made.

Token healing can be done to fix any weirdness that may occur if the draft model generates tokens that doesn't lie on a word boundary (not sure how much this affects performance).
The translation process can be cached to improve sampling time, however it might require substantial refactoring.

jukofyork · 2025-04-03T21:53:30Z

This looks really interesting! It's surprising how much crossover there is between many models' tokenisers.

Mushoz · 2025-06-28T18:43:53Z

Why was this closed? Was really looking forward to this

g2mt · 2025-06-28T18:45:56Z

Why was this closed? Was really looking forward to this

Sorry, Github was acting weird. For some reason when I merged my fork with upstream it added past upstream commits to the PR. I'll reopen it.

g2mt · 2025-06-28T19:23:09Z

Anyways, I merged changes from upstream and updated the PR. The assigned labels are wrong and I don't know how to fix them.

Here are some sample outputs:

./build/bin/llama-speculative-simple  -md ./pythia-160m.Q8_0.gguf  -m  ./Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf -p "The quick brown fox jumped over the lazy dog.\nThe quick brown fox jumped over the lazy dog.\nThe"


The quick brown fox jumped over the lazy dog.
The quick brown fox jumped over the lazy dog.
The quick brown fox jumped over the lazy dog.
The quick brown fox jumped over the

encoded   21 tokens in    0.081 seconds, speed:  260.130 t/s
decoded   16 tokens in    0.318 seconds, speed:   50.335 t/s

n_draft   = 16
n_predict = 16
n_drafted = 15
n_accept  = 15
accept    = 100.000%

draft:

llama_perf_context_print:        load time =     172.40 ms
llama_perf_context_print: prompt eval time =      30.86 ms /    23 tokens (    1.34 ms per token,   745.20 tokens per second)
llama_perf_context_print:        eval time =     206.02 ms /    15 runs   (   13.73 ms per token,    72.81 tokens per second)
llama_perf_context_print:       total time =     399.07 ms /    38 tokens

target:

llama_perf_sampler_print:    sampling time =       0.97 ms /    16 runs   (    0.06 ms per token, 16511.87 tokens per second)
llama_perf_context_print:        load time =     475.29 ms
llama_perf_context_print: prompt eval time =     394.66 ms /    36 tokens (   10.96 ms per token,    91.22 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =     571.55 ms /    37 tokens

I also added --spec-replace flag to replace the raw chat template tags between the main and the draft models. It only does a simple string replacement for each invocation, so it doesn't distinguish between special tokens and normal tokens.

./build/bin/llama-speculative-simple -m gemma-3-1b-it-Q4_0.gguf -md  Llama-3.2-1B-Instruct-Q4_0.gguf -p "<start_of_turn>user\nRepeat the following sentence, as is:\nThe quick brown fox jumped over the lazy dog.<end_of_turn>\n<start_of_turn>model\nThe" -n 10 --spec-replace "<bos><start_of_turn>user\n" "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n" --spec-replace "<end_of_turn>\n<start_of_turn>model\n" "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

<bos><start_of_turn>user
Repeat the following sentence, as is:
The quick brown fox jumped over the lazy dog.<end_of_turn>
<start_of_turn>model
The quick brown fox jumped over the lazy dog.


encoded   29 tokens in    0.110 seconds, speed:  264.042 t/s
decoded   11 tokens in    0.850 seconds, speed:   12.947 t/s

n_draft   = 16
n_predict = 11
n_drafted = 26
n_accept  = 9
accept    = 34.615%

draft:

llama_perf_context_print:        load time =    2210.29 ms
llama_perf_context_print: prompt eval time =     113.77 ms /    29 tokens (    3.92 ms per token,   254.89 tokens per second)
llama_perf_context_print:        eval time =     572.07 ms /    25 runs   (   22.88 ms per token,    43.70 tokens per second)
llama_perf_context_print:       total time =     960.16 ms /    54 tokens

target:

llama_perf_sampler_print:    sampling time =       0.83 ms /    11 runs   (    0.08 ms per token, 13269.00 tokens per second)
llama_perf_context_print:        load time =     903.03 ms
llama_perf_context_print: prompt eval time =     724.68 ms /    56 tokens (   12.94 ms per token,    77.28 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    3170.51 ms /    57 tokens

llama-server : implement universal assisted decoding

6f96269

g2mt requested a review from ngxson as a code owner March 28, 2025 23:17

github-actions bot added examples server labels Mar 28, 2025

Merge branch 'master' into master

6f74c9c

g2mt added 10 commits June 28, 2025 16:13

Merge remote-tracking branch 'fork/master' into universal-decoding

e667645

Erase prompt tail for kv-cache

ff9e062

set vocab_dft_compatible in common_speculative

39ca594

rename ctx_main to ctx_tgt

eb424dd

move vocab_dft_compatible to spec struct

2550f11

clear mem_dft, remove mem

3c35c9d

detokenize id_last for incompatible models

12751c9

update comment

8419931

add --spec-replace flag

b9fdf20

accept special tokens when translating between draft/main models

160769d

g2mt requested review from JohannesGaessler and ggerganov as code owners June 28, 2025 18:38

github-actions bot added the Ascend NPU issues specific to Ascend NPUs label Jun 28, 2025

Merge remote-tracking branch 'upstream/master'

ebaa82e

g2mt closed this Jun 28, 2025

g2mt reopened this Jun 28, 2025

g2mt marked this pull request as draft June 28, 2025 19:09

g2mt marked this pull request as ready for review June 28, 2025 19:23

Escape spec-replace

d1f32ab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama-server : implement universal assisted decoding #12635

llama-server : implement universal assisted decoding #12635

Uh oh!

g2mt commented Mar 28, 2025

Uh oh!

jukofyork commented Apr 3, 2025

Uh oh!

Mushoz commented Jun 28, 2025

Uh oh!

g2mt commented Jun 28, 2025 •

edited

Loading

Uh oh!

g2mt commented Jun 28, 2025

Uh oh!

Uh oh!

llama-server : implement universal assisted decoding #12635

Are you sure you want to change the base?

llama-server : implement universal assisted decoding #12635

Uh oh!

Conversation

g2mt commented Mar 28, 2025

Uh oh!

jukofyork commented Apr 3, 2025

Uh oh!

Mushoz commented Jun 28, 2025

Uh oh!

g2mt commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

g2mt commented Jun 28, 2025

Uh oh!

Uh oh!

g2mt commented Jun 28, 2025 •

edited

Loading