diff --git a/docker/llm/README.md b/docker/llm/README.md index 1691e6c66b8..2af11953439 100644 --- a/docker/llm/README.md +++ b/docker/llm/README.md @@ -25,7 +25,7 @@ Available images in hub are: | intelanalytics/ipex-llm-serving-cpu:2.1.0-SNAPSHOT | CPU Serving| | intelanalytics/ipex-llm-serving-xpu:2.1.0-SNAPSHOT | GPU Serving| | intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.1.0-SNAPSHOT | CPU Finetuning via Docker| -|intelanalytics/ipex-llm-finetune-qlora-cpu-k8s:2.1.0-SNAPSHOT|CPU Finetuning via Kubernetes| +| intelanalytics/ipex-llm-finetune-qlora-cpu-k8s:2.1.0-SNAPSHOT|CPU Finetuning via Kubernetes| | intelanalytics/ipex-llm-finetune-qlora-xpu:2.1.0-SNAPSHOT| GPU Finetuning| #### Run a Container diff --git a/docker/llm/finetune/xpu/Dockerfile b/docker/llm/finetune/xpu/Dockerfile index 3bd607062f1..928378d7021 100644 --- a/docker/llm/finetune/xpu/Dockerfile +++ b/docker/llm/finetune/xpu/Dockerfile @@ -4,6 +4,9 @@ ARG https_proxy ENV TZ=Asia/Shanghai ARG PIP_NO_CACHE_DIR=false +# When cache is enabled SYCL runtime will try to cache and reuse JIT-compiled binaries. +ENV SYCL_CACHE_PERSISTENT=1 + # retrive oneapi repo public key RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/intel-oneapi-archive-keyring.gpg > /dev/null && \ echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main " | tee /etc/apt/sources.list.d/oneAPI.list && \ diff --git a/docker/llm/inference-cpp/Dockerfile b/docker/llm/inference-cpp/Dockerfile index 32a3b6f95d3..da9c24dfbbc 100644 --- a/docker/llm/inference-cpp/Dockerfile +++ b/docker/llm/inference-cpp/Dockerfile @@ -6,6 +6,9 @@ ARG https_proxy ENV TZ=Asia/Shanghai ENV PYTHONUNBUFFERED=1 +# When cache is enabled SYCL runtime will try to cache and reuse JIT-compiled binaries. +ENV SYCL_CACHE_PERSISTENT=1 + # Disable pip's cache behavior ARG PIP_NO_CACHE_DIR=false diff --git a/docker/llm/inference/xpu/docker/Dockerfile b/docker/llm/inference/xpu/docker/Dockerfile index ab9fbd0cff8..89064cb0a2e 100644 --- a/docker/llm/inference/xpu/docker/Dockerfile +++ b/docker/llm/inference/xpu/docker/Dockerfile @@ -5,8 +5,9 @@ ARG https_proxy ENV TZ=Asia/Shanghai ENV PYTHONUNBUFFERED=1 -ENV USE_XETLA=OFF -ENV SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 + +# When cache is enabled SYCL runtime will try to cache and reuse JIT-compiled binaries. +ENV SYCL_CACHE_PERSISTENT=1 COPY chat.py /llm/chat.py COPY benchmark.sh /llm/benchmark.sh diff --git a/docs/mddocs/DockerGuides/docker_pytorch_inference_gpu.md b/docs/mddocs/DockerGuides/docker_pytorch_inference_gpu.md index 19fde15f6b6..4d0ff0fa3a2 100644 --- a/docs/mddocs/DockerGuides/docker_pytorch_inference_gpu.md +++ b/docs/mddocs/DockerGuides/docker_pytorch_inference_gpu.md @@ -78,6 +78,32 @@ root@arda-arc12:/# sycl-ls > bash env-check.sh > ``` +> [!NOTE] +> For optimal performance, it is recommended to set several environment variables according to your hardware environment. +> +> ```bash +> # Disable code related to XETLA; only Intel Data Center GPU Max Series supports XETLA, so non-Max machines should set this to OFF. +> # Recommended for use on Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series. +> export USE_XETLA=OFF +> +> # Enable immediate command lists mode for the Level Zero plugin. Improves performance on Intel Arc™ A-Series Graphics and Intel Data Center GPU Max Series; however, it depends on the Linux Kernel, and some Linux kernels may not necessarily provide acceleration. +> # Recommended for use on Intel Arc™ A-Series Graphics and Intel Data Center GPU Max Series, but it depends on the Linux kernel, Non-i915 kernel drivers may cause performance regressions. +> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +> +> # Controls persistent device compiled code cache. Set to '1' to turn on and '0' to turn off. +> # Recommended for all hardware environments. This environment variable is already set by default in Docker images. +> export SYCL_CACHE_PERSISTENT=1 +> +> # Reduce memory accesses by fusing SDP ops. +> # Recommended for use on Intel Data Center GPU Max Series. +> export ENABLE_SDP_FUSION=1 +> +> # Disable XMX computation. +> # Recommended for use on integrated GPUs. +> export BIGDL_LLM_XMX_DISABLED=1 +> ``` + + ## Run Inference Benchmark Navigate to benchmark directory, and modify the `config.yaml` under the `all-in-one` folder for benchmark configurations.