Skip to content

Commit 9e5518e

Browse files
committed
Merge branch 'main' of https://github.com/intel-analytics/ipex-llm into test_transformers_41
2 parents 6a6549f + 79978e6 commit 9e5518e

File tree

33 files changed

+1148
-1454
lines changed

33 files changed

+1148
-1454
lines changed

docker/llm/serving/xpu/docker/vllm_online_benchmark.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -270,7 +270,7 @@ def benchmark(llm_urls, model, prompt, num_requests, max_concurrent_requests, ma
270270
LLM_URLS = [f"http://localhost:{PORT}/v1/completions" for PORT in [8000]]
271271

272272

273-
MODEL = "llm/models/" + model_name
273+
MODEL = "/llm/models/" + model_name
274274
MAX_TOKENS = 512
275275

276276
PROMPT = PROMPT_1024

docs/mddocs/Quickstart/continue_quickstart.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,11 +33,12 @@ Visit [Run Ollama with IPEX-LLM on Intel GPU](./ollama_quickstart.md), and follo
3333
> If the `Continue` plugin is not installed on the same machine where Ollama is running (which means `Continue` needs to connect to a remote Ollama service), you must configure the Ollama service to accept connections from any IP address. To achieve this, set or export the environment variable `OLLAMA_HOST=0.0.0.0` before executing the command `ollama serve`.
3434
3535
> [!TIP]
36-
> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
36+
> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), setting the following environment variable before starting the service may potentially improve performance.
3737
>
3838
> ```bash
3939
> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
4040
> ```
41+
> The environment variable `SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS` determines the usage of immediate command lists for task submission to the GPU. While this mode typically enhances performance, exceptions may occur. Please consider experimenting with and without this environment variable for best performance. For more details, you can refer to [this article](https://www.intel.com/content/www/us/en/developer/articles/guide/level-zero-immediate-command-lists.html).
4142
4243
### 2. Pull and Prepare the Model
4344

docs/mddocs/Quickstart/fastchat_quickstart.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOU
6060
# Available low_bit format including sym_int4, sym_int8, fp16 etc.
6161
source /opt/intel/oneapi/setvars.sh
6262
export USE_XETLA=OFF
63+
# [optional] under most circumstances, the following environment variable may improve performance, but sometimes this may also cause performance degradation
6364
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
6465

6566
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu"
@@ -87,6 +88,7 @@ python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7
8788
source /opt/intel/oneapi/setvars.sh
8889
export ENABLE_SDP_FUSION=1
8990
export SYCL_CACHE_PERSISTENT=1
91+
# [optional] under most circumstances, the following environment variable may improve performance, but sometimes this may also cause performance degradation
9092
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
9193
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7b-v1.5 --low-bit "fp16" --trust-remote-code --device "xpu" --speculative
9294
```
@@ -117,10 +119,14 @@ python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MO
117119
# On GPU
118120
source /opt/intel/oneapi/setvars.sh
119121
export USE_XETLA=OFF
122+
# [optional] under most circumstances, the following environment variable may improve performance, but sometimes this may also cause performance degradation
120123
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
121124
python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device xpu --load-in-low-bit "sym_int4" --enforce-eager
122125
```
123126

127+
> [!NOTE]
128+
> The environment variable `SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS` determines the usage of immediate command lists for task submission to the GPU. While this mode typically enhances performance, exceptions may occur. Please consider experimenting with and without this environment variable for best performance. For more details, you can refer to [this article](https://www.intel.com/content/www/us/en/developer/articles/guide/level-zero-immediate-command-lists.html).
129+
124130
#### Launch multiple workers
125131

126132
Sometimes we may want to start multiple workers for the best performance. For running in CPU, you may want to seperate multiple workers in different sockets. Assuming each socket have 48 physicall cores, then you may want to start two workers using the following example:

docs/mddocs/Quickstart/install_linux_gpu.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -242,8 +242,9 @@ To use GPU acceleration on Linux, several environment variables are required or
242242
243243
# Recommended Environment Variables for optimal performance
244244
export USE_XETLA=OFF
245-
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
246245
export SYCL_CACHE_PERSISTENT=1
246+
# [optional] under most circumstances, the following environment variable may improve performance, but sometimes this may also cause performance degradation
247+
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
247248
```
248249
249250
- For **Intel Data Center GPU Max**:
@@ -257,16 +258,19 @@ To use GPU acceleration on Linux, several environment variables are required or
257258
258259
# Recommended Environment Variables for optimal performance
259260
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
260-
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
261261
export SYCL_CACHE_PERSISTENT=1
262262
export ENABLE_SDP_FUSION=1
263+
# [optional] under most circumstances, the following environment variable may improve performance, but sometimes this may also cause performance degradation
264+
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
263265
```
264266
265267
Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`
266268
267269
> [!NOTE]
268270
> Please refer to [this guide](../Overview/install_gpu.md#runtime-configuration-1) for more details regarding runtime configuration.
269271
272+
> [!NOTE]
273+
> The environment variable `SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS` determines the usage of immediate command lists for task submission to the GPU. While this mode typically enhances performance, exceptions may occur. Please consider experimenting with and without this environment variable for best performance. For more details, you can refer to [this article](https://www.intel.com/content/www/us/en/developer/articles/guide/level-zero-immediate-command-lists.html).
270274
271275
## A Quick Example
272276

docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ To use GPU acceleration, several environment variables are required or recommend
5151
```bash
5252
source /opt/intel/oneapi/setvars.sh
5353
export SYCL_CACHE_PERSISTENT=1
54+
# [optional] under most circumstances, the following environment variable may improve performance, but sometimes this may also cause performance degradation
5455
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
5556
# [optional] if you want to run on single GPU, use below command to limit GPU may improve performance
5657
export ONEAPI_DEVICE_SELECTOR=level_zero:0
@@ -62,44 +63,48 @@ To use GPU acceleration, several environment variables are required or recommend
6263

6364
```cmd
6465
set SYCL_CACHE_PERSISTENT=1
66+
rem under most circumstances, the following environment variable may improve performance, but sometimes this may also cause performance degradation
6567
set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
6668
```
6769

6870
> [!TIP]
6971
> When your machine has multi GPUs and you want to run on one of them, you need to set `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]`, here `[gpu_id]` varies based on your requirement. For more details, you can refer to [this section](../Overview/KeyFeatures/multi_gpus_selection.md#2-oneapi-device-selector).
7072
73+
> [!NOTE]
74+
> The environment variable `SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS` determines the usage of immediate command lists for task submission to the GPU. While this mode typically enhances performance, exceptions may occur. Please consider experimenting with and without this environment variable for best performance. For more details, you can refer to [this article](https://www.intel.com/content/www/us/en/developer/articles/guide/level-zero-immediate-command-lists.html).
75+
7176
##### Run llama3
7277

7378
Under your current directory, exceuting below command to do inference with Llama3:
7479

7580
- For **Linux users**:
7681

7782
```bash
78-
./main -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing something" -t 8 -e -ngl 33 --color --no-mmap
83+
./llama-cli -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing something" -c 1024 -t 8 -e -ngl 33 --color --no-mmap
7984
```
8085

8186
- For **Windows users**:
8287

8388
Please run the following command in Miniforge Prompt.
8489

8590
```cmd
86-
main -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing something" -e -ngl 33 --color --no-mmap
91+
llama-cli -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing something" -c 1024 -e -ngl 33 --color --no-mmap
8792
```
8893

8994
Under your current directory, you can also execute below command to have interactive chat with Llama3:
9095

9196
- For **Linux users**:
9297

9398
```bash
94-
./main -ngl 33 --interactive-first --color -e --in-prefix '<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -r '<|eot_id|>' -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
99+
./llama-cli -ngl 33 --interactive-first --color -e --in-prefix '<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -r '<|eot_id|>' -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -c 1024
95100
```
96101

97102
- For **Windows users**:
98103

99104
Please run the following command in Miniforge Prompt.
100105

101106
```cmd
102-
main -ngl 33 --interactive-first --color -e --in-prefix "<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -r "<|eot_id|>" -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
107+
llama-cli -ngl 33 --interactive-first --color -e --in-prefix "<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -r "<|eot_id|>" -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -c 1024
103108
```
104109

105110
Below is a sample output on Intel Arc GPU:
@@ -131,6 +136,7 @@ Launch the Ollama service:
131136
export OLLAMA_NUM_GPU=999
132137
source /opt/intel/oneapi/setvars.sh
133138
export SYCL_CACHE_PERSISTENT=1
139+
# [optional] under most circumstances, the following environment variable may improve performance, but sometimes this may also cause performance degradation
134140
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
135141
# [optional] if you want to run on single GPU, use below command to limit GPU may improve performance
136142
export ONEAPI_DEVICE_SELECTOR=level_zero:0
@@ -147,6 +153,7 @@ Launch the Ollama service:
147153
set ZES_ENABLE_SYSMAN=1
148154
set OLLAMA_NUM_GPU=999
149155
set SYCL_CACHE_PERSISTENT=1
156+
rem under most circumstances, the following environment variable may improve performance, but sometimes this may also cause performance degradation
150157
set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
151158
152159
ollama serve
@@ -160,6 +167,8 @@ Launch the Ollama service:
160167
> [!TIP]
161168
> When your machine has multi GPUs and you want to run on one of them, you need to set `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]`, here `[gpu_id]` varies based on your requirement. For more details, you can refer to [this section](../Overview/KeyFeatures/multi_gpus_selection.md#2-oneapi-device-selector).
162169
170+
> [!NOTE]
171+
> The environment variable `SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS` determines the usage of immediate command lists for task submission to the GPU. While this mode typically enhances performance, exceptions may occur. Please consider experimenting with and without this environment variable for best performance. For more details, you can refer to [this article](https://www.intel.com/content/www/us/en/developer/articles/guide/level-zero-immediate-command-lists.html).
163172
164173
##### 2.2.2 Using Ollama Run Llama3
165174

0 commit comments

Comments
 (0)