Regarding Speculative Decoding Implementation on CPU Backend #4

shuojiangliu · 2025-02-05T19:05:36Z

The current implementation of speculative decoding only supports Hexagon NPU with the Qualcomm QNN backend. To facilitate an easier debugging and produce some baseline output, I am trying to develop the speculative decoding approaches on CPU. However, when attempting to implement support for the CPU backend, the generated output contains some bugs. Specifically, the token sequences produced via speculative decoding on CPU do not produce meaningful outputs.

Update: Based on your implementation, the CPU and NPU have quite different KV cache management mechanisms (CPU only support batched sequential updates to KV cache). Speculative decoding with CPU backend cannot be correctly implemented without deep modification to the CPU side of the code, especially KV cache parts.

Steps to Reproduce

Refer to the temp-debug branch of my fork ( https://github.com/shuojiangliu/PowerServe/tree/temp-debug ), compile on Linux x86-64 machine (Ubuntu 22.04) and run the run executable with these program arguments:

--work-folder <work folder of models> --model smallthinker-3b --draft-model smallthinker-0.5b --prompt "Please introduce the company NIO." --no-qnn

(Note: My current branch has some redundant debugging printings and some manually changed code snippets to force the speculative decoding execution, and I will manage to remove them later with cleaner code.)

Expected Behavior

Speculative decoding on CPU should produce token sequences that are consistent with those produced on different backends with correct outputs.

Actual Behavior

My current configuration inside speculative_config.hpp is:

struct SpeculativeConfig {
    size_t draft_batch_size = 18; // 12

    struct {
        size_t top_k      = 2; // 15
        float temperature = 1.5f;
        float p_base      = 0.9f;
    } draft_sampler;

    struct {
        size_t max_fan_out = 3;
        float min_prob     = 0.2f;
        bool early_stop    = true;
        bool debug         = true;
    } token_tree;
};

Below is a recording of my debugging output (sorry for the low resolution of the video, but I have to compress it under 10MB to be successfully upload to GitHub):

Screen_Debugging_Results.mp4

Here is a sample token tree output (Note: I have made some modifications to the print_tree function to make the printing clearer):

[ACC] " Co" 1.00
├── [REJ] "." 0.76
│   ├── [REJ] " N" 0.51
│   │   ├── [REJ] "IO" 0.97
│   │   │   ├── [REJ] " is" 0.66
│   │   │   │   ├── [REJ] " for" 0.69
│   │   │   │   └── [REJ] " the" 0.31
│   │   │   └── [REJ] "," 0.34
│   │   └── [REJ] "io" 0.03
│   └── [REJ] " (" 0.49
│       ├── [REJ] "N" 0.81
│       │   ├── [REJ] "IO" 0.84
│       │   │   ├── [REJ] ")" 0.76
│       │   │   └── [REJ] ")," 0.24
│       │   └── [REJ] "OR" 0.16
│       └── [REJ] "formerly" 0.19
└── [REJ] " Ltd" 0.24

And the final output is like this:

Final Output: 
 Resources Resources Resources Resources Resources Resources Resources Resources Co Resources Co

 Resources Co

Speculative token tree statistics:
- 10 iterations, 16 generated tokens
- 1.600 tokens/iteration
- 7.600 draft-forwards/iteration
- Accept ratio: 4.580%
- Draft effective ratio: 7.895%

For a more detailed output, please refer to the debug logs.

It appears that either the token tree is not being constructed correctly (such as the issues with the priority queues or candidate selection) or the verification process does not correctly align the candidate tokens with the target model's outputs. Besides, the accept ratio of the draft model is pretty low.

Additional Information

My current branch has negligible modifications on the implementations of the speculative decoding algorithm in token_tree.cpp, with only minor adjustments including the execution of speculative decoding operations, configurations of the token tree, etc. The aim is to run with hardware agnostic code logic under minimal modifications.
The implementation of speculative decoding algorithm in PowerServe seems to be a blend of SpecInfer and EAGLE-2. Could you elaborate more on the algorithm you are adopting?
Any insights into how attention masks or KV caches are managed differently between the backends would be appreciated.

Could someone help diagnose and resolve the discrepancies observed in speculative decoding on the CPU backend? Any suggestions for debugging or further testing would be helpful.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding Speculative Decoding Implementation on CPU Backend #4

Regarding Speculative Decoding Implementation on CPU Backend #4

shuojiangliu commented Feb 5, 2025 •

edited

Loading

Regarding Speculative Decoding Implementation on CPU Backend #4

Regarding Speculative Decoding Implementation on CPU Backend #4

Comments

shuojiangliu commented Feb 5, 2025 • edited Loading

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Information

shuojiangliu commented Feb 5, 2025 •

edited

Loading