Skip to content

llama-server : implement universal assisted decoding #12635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

g2mt
Copy link

@g2mt g2mt commented Mar 28, 2025

This pull request implements universal assisted decoding in llama-server. This is a method for performing speculative decoding given a draft model whose tokenizer is incompatible with the main model, by decoding and reencoding the generated text between the two models.

It currently works, but some improvements can be made.

  • Token healing can be done to fix any weirdness that may occur if the draft model generates tokens that doesn't lie on a word boundary (not sure how much this affects performance).
  • The translation process can be cached to improve sampling time, however it might require substantial refactoring.

@jukofyork
Copy link
Collaborator

This looks really interesting! It's surprising how much crossover there is between many models' tokenisers.

@github-actions github-actions bot added documentation Improvements or additions to documentation build Compilation issues script Script related testing Everything test related android Issues specific to Android Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend python python script changes devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jun 28, 2025
@github-actions github-actions bot added the Ascend NPU issues specific to Ascend NPUs label Jun 28, 2025
@g2mt g2mt closed this Jun 28, 2025
@Mushoz
Copy link

Mushoz commented Jun 28, 2025

Why was this closed? Was really looking forward to this

@g2mt
Copy link
Author

g2mt commented Jun 28, 2025

Why was this closed? Was really looking forward to this

Sorry, Github was acting weird. For some reason when I merged my fork with upstream it added past upstream commits to the PR. I'll reopen it.

@g2mt g2mt reopened this Jun 28, 2025
@g2mt g2mt marked this pull request as draft June 28, 2025 19:09
@g2mt
Copy link
Author

g2mt commented Jun 28, 2025

Anyways, I merged changes from upstream and updated the PR. The assigned labels are wrong and I don't know how to fix them.

Here are some sample outputs:

./build/bin/llama-speculative-simple  -md ./pythia-160m.Q8_0.gguf  -m  ./Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf -p "The quick brown fox jumped over the lazy dog.\nThe quick brown fox jumped over the lazy dog.\nThe"

The quick brown fox jumped over the lazy dog.
The quick brown fox jumped over the lazy dog.
The quick brown fox jumped over the lazy dog.
The quick brown fox jumped over the

encoded   21 tokens in    0.081 seconds, speed:  260.130 t/s
decoded   16 tokens in    0.318 seconds, speed:   50.335 t/s

n_draft   = 16
n_predict = 16
n_drafted = 15
n_accept  = 15
accept    = 100.000%

draft:

llama_perf_context_print:        load time =     172.40 ms
llama_perf_context_print: prompt eval time =      30.86 ms /    23 tokens (    1.34 ms per token,   745.20 tokens per second)
llama_perf_context_print:        eval time =     206.02 ms /    15 runs   (   13.73 ms per token,    72.81 tokens per second)
llama_perf_context_print:       total time =     399.07 ms /    38 tokens

target:

llama_perf_sampler_print:    sampling time =       0.97 ms /    16 runs   (    0.06 ms per token, 16511.87 tokens per second)
llama_perf_context_print:        load time =     475.29 ms
llama_perf_context_print: prompt eval time =     394.66 ms /    36 tokens (   10.96 ms per token,    91.22 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =     571.55 ms /    37 tokens

I also added --spec-replace flag to replace the raw chat template tags between the main and the draft models. It only does a simple string replacement for each invocation, so it doesn't distinguish between special tokens and normal tokens.

./build/bin/llama-speculative-simple -m gemma-3-1b-it-Q4_0.gguf -md  Llama-3.2-1B-Instruct-Q4_0.gguf -p "<start_of_turn>user\nRepeat the following sentence, as is:\nThe quick brown fox jumped over the lazy dog.<end_of_turn>\n<start_of_turn>model\nThe" -n 10 --spec-replace "<bos><start_of_turn>user\n" "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n" --spec-replace "<end_of_turn>\n<start_of_turn>model\n" "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
<bos><start_of_turn>user
Repeat the following sentence, as is:
The quick brown fox jumped over the lazy dog.<end_of_turn>
<start_of_turn>model
The quick brown fox jumped over the lazy dog.


encoded   29 tokens in    0.110 seconds, speed:  264.042 t/s
decoded   11 tokens in    0.850 seconds, speed:   12.947 t/s

n_draft   = 16
n_predict = 11
n_drafted = 26
n_accept  = 9
accept    = 34.615%

draft:

llama_perf_context_print:        load time =    2210.29 ms
llama_perf_context_print: prompt eval time =     113.77 ms /    29 tokens (    3.92 ms per token,   254.89 tokens per second)
llama_perf_context_print:        eval time =     572.07 ms /    25 runs   (   22.88 ms per token,    43.70 tokens per second)
llama_perf_context_print:       total time =     960.16 ms /    54 tokens

target:

llama_perf_sampler_print:    sampling time =       0.83 ms /    11 runs   (    0.08 ms per token, 13269.00 tokens per second)
llama_perf_context_print:        load time =     903.03 ms
llama_perf_context_print: prompt eval time =     724.68 ms /    56 tokens (   12.94 ms per token,    77.28 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    3170.51 ms /    57 tokens

@g2mt g2mt marked this pull request as ready for review June 28, 2025 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
android Issues specific to Android Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs build Compilation issues devops improvements to build systems and github actions documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes script Script related server SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants