In this directory, you will find the examples on how IPEX-LLM accelerate inference with speculative sampling using EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a speculative sampling method that improves text generation speed) on Intel CPUs. See here to view the paper and here for more info on EAGLE code.
To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to here for more information. Make sure you have installed ipex-llm
before:
In this example, we run inference for a Llama2 model to showcase the speed of EAGLE with IPEX-LLM on MT-bench data on Intel CPUs.
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to here.
After installing conda, create a Python environment for IPEX-LLM:
conda create -n llm python=3.11 # recommend to use Python 3.11
conda activate llm
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
pip install intel_extension_for_pytorch==2.1.0
pip install -r requirements.txt
pip install transformers==4.36.2
pip install gradio==3.50.2
pip install eagle-llm
Note
Skip this step if you are running on Windows.
# set IPEX-LLM env variables
source ipex-llm-init
You can test the speed of EAGLE speculative sampling with ipex-llm on MT-bench using the following command.
python -m evaluation.gen_ea_answer_llama2chat\
--ea-model-path [path of EAGLE weight]\
--base-model-path [path of the original model]\
--enable-ipex-llm\
Please refer to here for the complete list of available EAGLE weights.
The above command will generate a .jsonl file that records the generation results and wall time. Then, you can use evaluation/speed.py to calculate the speed.
python -m evaluation.speed\
--base-model-path [path of the original model]\
--jsonl-file [pathname of the .jsonl file]\