You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation of speculative decoding only supports Hexagon NPU with the Qualcomm QNN backend. To facilitate an easier debugging and produce some baseline output, I am trying to develop the speculative decoding approaches on CPU. However, when attempting to implement support for the CPU backend, the generated output contains some bugs. Specifically, the token sequences produced via speculative decoding on CPU do not produce meaningful outputs.
Update: Based on your implementation, the CPU and NPU have quite different KV cache management mechanisms (CPU only support batched sequential updates to KV cache). Speculative decoding with CPU backend cannot be correctly implemented without deep modification to the CPU side of the code, especially KV cache parts.
--work-folder <work folder of models> --model smallthinker-3b --draft-model smallthinker-0.5b --prompt "Please introduce the company NIO." --no-qnn
(Note: My current branch has some redundant debugging printings and some manually changed code snippets to force the speculative decoding execution, and I will manage to remove them later with cleaner code.)
Expected Behavior
Speculative decoding on CPU should produce token sequences that are consistent with those produced on different backends with correct outputs.
Below is a recording of my debugging output (sorry for the low resolution of the video, but I have to compress it under 10MB to be successfully upload to GitHub):
Screen_Debugging_Results.mp4
Here is a sample token tree output (Note: I have made some modifications to the print_tree function to make the printing clearer):
Final Output:
Resources Resources Resources Resources Resources Resources Resources Resources Co Resources Co
Resources Co
Speculative token tree statistics:
- 10 iterations, 16 generated tokens
- 1.600 tokens/iteration
- 7.600 draft-forwards/iteration
- Accept ratio: 4.580%
- Draft effective ratio: 7.895%
For a more detailed output, please refer to the debug logs.
It appears that either the token tree is not being constructed correctly (such as the issues with the priority queues or candidate selection) or the verification process does not correctly align the candidate tokens with the target model's outputs. Besides, the accept ratio of the draft model is pretty low.
Additional Information
My current branch has negligible modifications on the implementations of the speculative decoding algorithm in token_tree.cpp, with only minor adjustments including the execution of speculative decoding operations, configurations of the token tree, etc. The aim is to run with hardware agnostic code logic under minimal modifications.
The implementation of speculative decoding algorithm in PowerServe seems to be a blend of SpecInfer and EAGLE-2. Could you elaborate more on the algorithm you are adopting?
Any insights into how attention masks or KV caches are managed differently between the backends would be appreciated.
Could someone help diagnose and resolve the discrepancies observed in speculative decoding on the CPU backend? Any suggestions for debugging or further testing would be helpful.
The text was updated successfully, but these errors were encountered:
The current implementation of speculative decoding only supports Hexagon NPU with the Qualcomm QNN backend. To facilitate an easier debugging and produce some baseline output, I am trying to develop the speculative decoding approaches on CPU. However, when attempting to implement support for the CPU backend, the generated output contains some bugs. Specifically, the token sequences produced via speculative decoding on CPU do not produce meaningful outputs.
Update: Based on your implementation, the CPU and NPU have quite different KV cache management mechanisms (CPU only support batched sequential updates to KV cache). Speculative decoding with CPU backend cannot be correctly implemented without deep modification to the CPU side of the code, especially KV cache parts.
Steps to Reproduce
Refer to the
temp-debug
branch of my fork ( https://github.com/shuojiangliu/PowerServe/tree/temp-debug ), compile on Linux x86-64 machine (Ubuntu 22.04) and run therun
executable with these program arguments:(Note: My current branch has some redundant debugging printings and some manually changed code snippets to force the speculative decoding execution, and I will manage to remove them later with cleaner code.)
Expected Behavior
Speculative decoding on CPU should produce token sequences that are consistent with those produced on different backends with correct outputs.
Actual Behavior
My current configuration inside speculative_config.hpp is:
Below is a recording of my debugging output (sorry for the low resolution of the video, but I have to compress it under 10MB to be successfully upload to GitHub):
Screen_Debugging_Results.mp4
Here is a sample token tree output (Note: I have made some modifications to the
print_tree
function to make the printing clearer):And the final output is like this:
For a more detailed output, please refer to the debug logs.
It appears that either the token tree is not being constructed correctly (such as the issues with the priority queues or candidate selection) or the verification process does not correctly align the candidate tokens with the target model's outputs. Besides, the accept ratio of the draft model is pretty low.
Additional Information
Could someone help diagnose and resolve the discrepancies observed in speculative decoding on the CPU backend? Any suggestions for debugging or further testing would be helpful.
The text was updated successfully, but these errors were encountered: