diff --git a/docs/articles_en/about-openvino/key-features.rst b/docs/articles_en/about-openvino/key-features.rst index c751a5bc65d3cf..7e4ffab3cbb2ec 100644 --- a/docs/articles_en/about-openvino/key-features.rst +++ b/docs/articles_en/about-openvino/key-features.rst @@ -14,7 +14,7 @@ Easy Integration OpenVINO optimizations to your PyTorch models directly with a single line of code. | :doc:`GenAI Out Of The Box <../openvino-workflow-generative/inference-with-genai>` -| With the genAI flavor of OpenVINO, you can run generative AI with just a couple lines of code. +| With the OpenVINO GenAI, you can run generative models with just a few lines of code. Check out the GenAI guide for instructions on how to do it. | `Python / C++ / C / NodeJS APIs `__ diff --git a/docs/articles_en/about-openvino/release-notes-openvino.rst b/docs/articles_en/about-openvino/release-notes-openvino.rst index 0134ed15215541..0009a1fe172fb0 100644 --- a/docs/articles_en/about-openvino/release-notes-openvino.rst +++ b/docs/articles_en/about-openvino/release-notes-openvino.rst @@ -26,10 +26,9 @@ OpenVINO Release Notes What's new +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -* OpenVINO 2024.6 release includes updates for enhanced stability and improved LLM performance. -* Introduced support for Intel® Arc™ B-Series Graphics (formerly known as Battlemage). -* Implemented optimizations to improve the inference time and LLM performance on NPUs. -* Improved LLM performance with GenAI API optimizations and bug fixes. +* . + + @@ -47,11 +46,8 @@ CPU Device Plugin GPU Device Plugin ----------------------------------------------------------------------------------------------- -* Device memory copy optimizations have been introduced for inference with **Intel® Arc™ B-Series - Graphics** (formerly known as Battlemage). Since it does not utilize L2 cache for copying memory - between the device and host, a dedicated `copy` operation is used, if inputs or results are - not expected in the device memory. -* ChatGLM4 inference on GPU has been optimized. +* . + NPU Device Plugin ----------------------------------------------------------------------------------------------- @@ -76,10 +72,6 @@ Other Changes and Known Issues Jupyter Notebooks ----------------------------- -* `Visual-language assistant with GLM-Edge-V and OpenVINO `__ -* `Local AI and OpenVINO `__ -* `Multimodal understanding and generation with Janus and OpenVINO `__ - @@ -131,39 +123,19 @@ Discontinued in 2024 * Runtime components: - * Intel® Gaussian & Neural Accelerator (Intel® GNA). Consider using the Neural Processing - Unit (NPU) for low-powered systems like Intel® Core™ Ultra or 14th generation and beyond. - * OpenVINO C++/C/Python 1.0 APIs (see - `2023.3 API transition guide `__ - for reference). - * All ONNX Frontend legacy API (known as ONNX_IMPORTER_API). - * ``PerfomanceMode.UNDEFINED`` property as part of the OpenVINO Python API. + * The OpenVINO property of Affinity API will is no longer available. It has been replaced with CPU + binding configurations (``ov::hint::enable_cpu_pinning``). * Tools: - * Deployment Manager. See :doc:`installation <../get-started/install-openvino>` and - :doc:`deployment <../get-started/install-openvino>` guides for current distribution - options. - * `Accuracy Checker `__. - * `Post-Training Optimization Tool `__ - (POT). Neural Network Compression Framework (NNCF) should be used instead. - * A `Git patch `__ - for NNCF integration with `huggingface/transformers `__. - The recommended approach is to use `huggingface/optimum-intel `__ - for applying NNCF optimization on top of models from Hugging Face. - * Support for Apache MXNet, Caffe, and Kaldi model formats. Conversion to ONNX may be used - as a solution. - * The macOS x86_64 debug bins are no longer provided with the OpenVINO toolkit, starting - with OpenVINO 2024.5. - * Python 3.8 is no longer supported, starting with OpenVINO 2024.5. - - * As MxNet doesn't support Python version higher than 3.8, according to the - `MxNet PyPI project `__, - it is no longer supported by OpenVINO, either. - - * Discrete Keem Bay support is no longer supported, starting with OpenVINO 2024.5. - * Support for discrete devices (formerly codenamed Raptor Lake) is no longer available for - NPU. + * The OpenVINO™ Development Tools package (pip install openvino-dev) is no longer available + for OpenVINO releases in 2025. + * Model Optimizer is no longer available. Consider using the + :doc:`new conversion methods <../openvino-workflow/model-preparation/convert-model-to-ir>` + instead. For more details, see the + `model conversion transition guide `__. + * Intel® Streaming SIMD Extensions (Intel® SSE) are currently not enabled in the binary + package by default. They are still supported in the source code form. Deprecated and to be removed in the future @@ -175,26 +147,18 @@ Deprecated and to be removed in the future standard support. * The openvino-nightly PyPI module will soon be discontinued. End-users should proceed with the Simple PyPI nightly repo instead. More information in - `Release Policy `__. -* The OpenVINO™ Development Tools package (pip install openvino-dev) will be removed from - installation options and distribution channels beginning with OpenVINO 2025.0. -* Model Optimizer will be discontinued with OpenVINO 2025.0. Consider using the - :doc:`new conversion methods <../openvino-workflow/model-preparation/convert-model-to-ir>` - instead. For more details, see the - `model conversion transition guide `__. -* OpenVINO property Affinity API will be discontinued with OpenVINO 2025.0. - It will be replaced with CPU binding configurations (``ov::hint::enable_cpu_pinning``). - + `Release Policy `__. +* “auto shape” and “auto batch size” (reshaping a model in runtime) will be removed in the + future. OpenVINO's dynamic shape models are recommended instead. +* MacOS x86 is no longer recommended for use due to the discontinuation of validation. + Full support will be removed later in 2025. +* The `openvino` namespace of the OpenVINO Python API has been redesigned, removing the nested + `openvino.runtime` module. The old namespace is now considered deprecated and will be + discontinued in 2026.0. - * “auto shape” and “auto batch size” (reshaping a model in runtime) will be removed in the - future. OpenVINO's dynamic shape models are recommended instead. - -* Starting with 2025.0 MacOS x86 is no longer recommended for use due to the discontinuation - of validation. Full support will be removed later in 2025. - @@ -203,17 +167,13 @@ Legal Information +++++++++++++++++++++++++++++++++++++++++++++ You may not use or facilitate the use of this document in connection with any infringement -or other legal analysis concerning Intel products described herein. - -You agree to grant Intel a non-exclusive, royalty-free license to any patent claim -thereafter drafted which includes subject matter disclosed herein. +or other legal analysis concerning Intel products described herein. All information provided +here is subject to change without notice. Contact your Intel representative to obtain the +latest Intel product specifications and roadmaps. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. -All information provided here is subject to change without notice. Contact your Intel -representative to obtain the latest Intel product specifications and roadmaps. - The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. @@ -225,10 +185,9 @@ or from the OEM or retailer. No computer system can be absolutely secure. -Intel, Atom, Core, Xeon, OpenVINO, and the Intel logo are trademarks -of Intel Corporation in the U.S. and/or other countries. - -Other names and brands may be claimed as the property of others. +Intel, Atom, Core, Xeon, OpenVINO, and the Intel logo are trademarks of Intel Corporation in +the U.S. and/or other countries. Other names and brands may be claimed as the property of +others. Copyright © 2025, Intel Corporation. All rights reserved. diff --git a/docs/articles_en/documentation/openvino-ecosystem.rst b/docs/articles_en/documentation/openvino-ecosystem.rst index cb62672c032412..fbd4b6e53240a3 100644 --- a/docs/articles_en/documentation/openvino-ecosystem.rst +++ b/docs/articles_en/documentation/openvino-ecosystem.rst @@ -24,7 +24,7 @@ you an overview of a whole ecosystem of tools and solutions under the OpenVINO u | **GenAI** | :bdg-link-dark:`Github ` - :bdg-link-success:`User Guide ` + :bdg-link-success:`User Guide ` OpenVINO™ GenAI Library aims to simplify running inference of generative AI models. Check the LLM-powered Chatbot Jupyter notebook to see how GenAI works. @@ -113,7 +113,7 @@ generative AI and vision models directly on your computer or edge device using O | **Tokenizers** | :bdg-link-dark:`Github ` - :bdg-link-success:`User Guide ` + :bdg-link-success:`User Guide ` OpenVINO Tokenizers add text processing operations to OpenVINO. diff --git a/docs/articles_en/get-started/configurations.rst b/docs/articles_en/get-started/configurations.rst index 3e471c33445292..c0e885dd956c78 100644 --- a/docs/articles_en/get-started/configurations.rst +++ b/docs/articles_en/get-started/configurations.rst @@ -32,8 +32,9 @@ potential of OpenVINO™. Check the following list for components used in your w for details. | **OpenVINO GenAI Dependencies** -| OpenVINO GenAI is a flavor of OpenVINO, aiming to simplify running generative - AI models. For information on the dependencies required to use OpenVINO GenAI, see the +| OpenVINO GenAI is a tool based on the OpenVNO Runtime but simplifying the process of + running generative AI models. For information on the dependencies required to use + OpenVINO GenAI, see the :doc:`guide on OpenVINO GenAI Dependencies `. | **Open Computer Vision Library** diff --git a/docs/articles_en/get-started/install-openvino.rst b/docs/articles_en/get-started/install-openvino.rst index 401aa79213e6d7..1b658b2da9d49d 100644 --- a/docs/articles_en/get-started/install-openvino.rst +++ b/docs/articles_en/get-started/install-openvino.rst @@ -11,11 +11,11 @@ Install OpenVINO™ 2024.6 :maxdepth: 3 :hidden: + OpenVINO GenAI OpenVINO Runtime on Linux OpenVINO Runtime on Windows OpenVINO Runtime on macOS Create an OpenVINO Yocto Image - OpenVINO GenAI Flavor .. raw:: html @@ -30,13 +30,13 @@ All currently supported versions are: * 2023.3 (LTS) -.. dropdown:: Effortless GenAI integration with OpenVINO GenAI Flavor +.. dropdown:: Effortless GenAI integration with OpenVINO GenAI - A new OpenVINO GenAI Flavor streamlines application development by providing - LLM-specific interfaces for easy integration of language models, handling tokenization and - text generation. For installation and usage instructions, proceed to - :doc:`Install OpenVINO GenAI Flavor <../openvino-workflow-generative>` and - :doc:`Run LLMs with OpenVINO GenAI Flavor <../openvino-workflow-generative/inference-with-genai>`. + OpenVINO GenAI streamlines application development by providing LLM-specific interfaces for + easy integration of language models, handling tokenization and text generation. + For installation and usage instructions, check + :doc:`OpenVINO GenAI installation <../openvino-workflow-generative>` and + :doc:`inference with OpenVINO GenAI <../openvino-workflow-generative/inference-with-genai>`. .. dropdown:: Building OpenVINO from Source diff --git a/docs/articles_en/get-started/install-openvino/install-openvino-genai.rst b/docs/articles_en/get-started/install-openvino/install-openvino-genai.rst index b548353b36977e..026a76f2ee86d7 100644 --- a/docs/articles_en/get-started/install-openvino/install-openvino-genai.rst +++ b/docs/articles_en/get-started/install-openvino/install-openvino-genai.rst @@ -1,24 +1,26 @@ Install OpenVINO™ GenAI ==================================== -OpenVINO GenAI is a new flavor of OpenVINO, aiming to simplify running inference of generative AI models. -It hides the complexity of the generation process and minimizes the amount of code required. -You can now provide a model and input context directly to OpenVINO, which performs tokenization of the -input text, executes the generation loop on the selected device, and returns the generated text. -For a quickstart guide, refer to the :doc:`GenAI API Guide <../../openvino-workflow-generative/inference-with-genai>`. - -To see GenAI in action, check the Jupyter notebooks: -`LLM-powered Chatbot `__ and +OpenVINO GenAI is a tool, simplifying generative AI model inference. It is based on the +OpenVINO Runtime, hiding the complexity of the generation process and minimizing the amount of +code required. You provide a model and the input context directly to the tool, while it +performs tokenization of the input text, executes the generation loop on the selected device, +and returns the generated content. For a quickstart guide, refer to the +:doc:`GenAI API Guide <../../openvino-workflow-generative/inference-with-genai>`. + +To see OpenVINO GenAI in action, check these Jupyter notebooks: +`LLM-powered Chatbot `__ +and `LLM Instruction-following pipeline `__. -The OpenVINO GenAI flavor is available for installation via PyPI and Archive distributions. +OpenVINO GenAI is available for installation via PyPI and Archive distributions. A `detailed guide `__ on how to build OpenVINO GenAI is available in the OpenVINO GenAI repository. PyPI Installation ############################### -To install the GenAI flavor of OpenVINO via PyPI, follow the standard :doc:`installation steps `, +To install the GenAI package via PyPI, follow the standard :doc:`installation steps `, but use the *openvino-genai* package instead of *openvino*: .. code-block:: python @@ -28,9 +30,9 @@ but use the *openvino-genai* package instead of *openvino*: Archive Installation ############################### -The OpenVINO GenAI archive package includes the OpenVINO™ Runtime and :doc:`Tokenizers <../../openvino-workflow-generative/ov-tokenizers>`. -To install the GenAI flavor of OpenVINO from an archive file, follow the standard installation steps for your system -but instead of using the vanilla package file, download the one with OpenVINO GenAI: +The OpenVINO GenAI archive package includes the OpenVINO™ Runtime, as well as :doc:`Tokenizers <../../openvino-workflow-generative/ov-tokenizers>`. +It installs the same way as the standard OpenVINO Runtime, so follow its installation steps, +just use the OpenVINO GenAI package instead: Linux ++++++++++++++++++++++++++ diff --git a/docs/articles_en/openvino-workflow-generative.rst b/docs/articles_en/openvino-workflow-generative.rst index a4fa53335988ae..e67f52a0ecab12 100644 --- a/docs/articles_en/openvino-workflow-generative.rst +++ b/docs/articles_en/openvino-workflow-generative.rst @@ -90,8 +90,8 @@ The advantages of using OpenVINO for generative model deployment: Proceed to guides on: -* :doc:`OpenVINO GenAI Flavor <./openvino-workflow-generative/inference-with-genai>` +* :doc:`OpenVINO GenAI <./openvino-workflow-generative/inference-with-genai>` * :doc:`Hugging Face and Optimum Intel <./openvino-workflow-generative/inference-with-optimum-intel>` -* `Generative AI with Base OpenVINO `__ +* `Generative AI with Base OpenVINO `__ diff --git a/docs/articles_en/openvino-workflow-generative/inference-with-genai.rst b/docs/articles_en/openvino-workflow-generative/inference-with-genai.rst index 1f19c3eed7da8f..7e26f0891f779a 100644 --- a/docs/articles_en/openvino-workflow-generative/inference-with-genai.rst +++ b/docs/articles_en/openvino-workflow-generative/inference-with-genai.rst @@ -2,13 +2,13 @@ Inference with OpenVINO GenAI =============================================================================================== .. meta:: - :description: Learn how to use the OpenVINO GenAI flavor to execute LLM models. + :description: Learn how to use OpenVINO GenAI to execute LLM models. .. toctree:: :maxdepth: 1 :hidden: - NPU inference of LLMs + NPU inference of LLMs OpenVINO™ GenAI is a library of pipelines and methods, extending the OpenVINO runtime to work diff --git a/docs/articles_en/openvino-workflow-generative/inference-with-genai-on-npu.rst b/docs/articles_en/openvino-workflow-generative/inference-with-genai/inference-with-genai-on-npu.rst similarity index 97% rename from docs/articles_en/openvino-workflow-generative/inference-with-genai-on-npu.rst rename to docs/articles_en/openvino-workflow-generative/inference-with-genai/inference-with-genai-on-npu.rst index 8fb6ad27c4232f..540d13894c7d02 100644 --- a/docs/articles_en/openvino-workflow-generative/inference-with-genai-on-npu.rst +++ b/docs/articles_en/openvino-workflow-generative/inference-with-genai/inference-with-genai-on-npu.rst @@ -2,9 +2,10 @@ Inference with OpenVINO GenAI ========================================== .. meta:: - :description: Learn how to use the OpenVINO GenAI flavor to execute LLM models on NPU. + :description: Learn how to use OpenVINO GenAI to execute LLM models on NPU. -This guide will give you extra details on how to utilize NPU with the GenAI flavor. + +This guide will give you extra details on how to utilize NPU with OpenVINO GenAI. :doc:`See the installation guide <../../get-started/install-openvino/install-openvino-genai>` for information on how to start. @@ -24,6 +25,10 @@ Note that for systems based on Intel® Core™ Ultra Processors Series 2, more t may be required to run prompts over 1024 tokens on models exceeding 7B parameters, such as Llama-2-7B, Mistral-0.2-7B, and Qwen-2-7B. +Make sure your model works with NPU. Some models may not be supported, for example, +**the FLUX.1 pipeline is currently not supported by the device**. + + Export an LLM model via Hugging Face Optimum-Intel ################################################## diff --git a/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/npu-device.rst b/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/npu-device.rst index a3bdbfc7c2b7d1..ed28633f1a9198 100644 --- a/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/npu-device.rst +++ b/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/npu-device.rst @@ -22,7 +22,7 @@ for more streamlined resource management. NPU Plugin is now available through all relevant OpenVINO distribution channels. | **Supported Platforms:** -| Host: Intel® Core™ Ultra (former Meteor Lake) +| Host: Intel® Core™ Ultra series | NPU device: NPU 3720 | OS: Ubuntu* 22.04 64-bit (with Linux kernel 6.6+), MS Windows* 11 64-bit (22H2, 23H2) @@ -33,10 +33,10 @@ Follow the instructions below to install the latest NPU drivers: * `Linux driver `__ -The plugin uses the graph extension API exposed by the driver to convert the OpenVINO specific representation -of the model into a proprietary format. The compiler included in the user mode driver (UMD) performs -platform specific optimizations in order to efficiently schedule the execution of network layers and -memory transactions on various NPU hardware submodules. +The plugin uses the graph extension API exposed by the driver to convert the OpenVINO specific +representation of the model into a proprietary format. The compiler included in the user mode +driver (UMD) performs platform specific optimizations in order to efficiently schedule the +execution of network layers and memory transactions on various NPU hardware submodules. To use NPU for inference, pass the device name to the ``ov::Core::compile_model()`` method: diff --git a/docs/articles_en/openvino-workflow/running-inference/optimize-inference/optimizing-latency.rst b/docs/articles_en/openvino-workflow/running-inference/optimize-inference/optimizing-latency.rst index 7d6df9166f163e..febba3134cad40 100644 --- a/docs/articles_en/openvino-workflow/running-inference/optimize-inference/optimizing-latency.rst +++ b/docs/articles_en/openvino-workflow/running-inference/optimize-inference/optimizing-latency.rst @@ -14,34 +14,62 @@ Optimizing for Latency improve throughput without degrading latency. -A significant portion of deep learning use cases involve applications loading a single model and using a single input at a time, which is the of typical "consumer" scenario. -While an application can create more than one request if needed, for example to support :ref:`asynchronous inputs population `, its **inference performance depends on how many requests are being inferenced in parallel** on a device. - -Similarly, when multiple models are served on the same device, it is important whether the models are executed simultaneously or in a chain, for example, in the inference pipeline. -As expected, the easiest way to achieve **low latency is by running only one inference at a time** on one device. Accordingly, any additional concurrency usually results in latency rising fast. - -However, some conventional "root" devices (i.e., CPU or GPU) can be in fact internally composed of several "sub-devices". In many cases, letting OpenVINO leverage the "sub-devices" transparently helps to improve application's throughput (e.g., serve multiple clients simultaneously) without degrading latency. For example, multi-socket CPUs can deliver as many requests at the same minimal latency as there are NUMA nodes in the system. Similarly, a multi-tile GPU, which is essentially multiple GPUs in a single package, can deliver a multi-tile scalability with the number of inference requests, while preserving the single-tile latency. - -Typically, human expertise is required to get more "throughput" out of the device, even in the inherently latency-oriented cases. OpenVINO can take this configuration burden via :doc:`high-level performance hints `, the `ov::hint::PerformanceMode::LATENCY `__ specified for the ``ov::hint::performance_mode`` property for the ``compile_model``. +An application that loads a single model and uses a single input at a time is +a widespread use case in deep learning. Surely, more requests can be created if +needed, for example to support :ref:`asynchronous input population `. +However, **the number of parallel requests affects inference performance** +of the application. + +Also, running inference of multiple models on the same device relies on whether the models +are executed simultaneously or in a chain: the more inference tasks at once, the higher the +latency. + +However, devices such as CPUs and GPUs may be composed of several "sub-devices". OpeVINO can +handle them transparently, when serving multiple clients, improving application's throughput +without impacting latency. What is more, multi-socket CPUs can deliver as many requests at the +same minimal latency as there are NUMA nodes in the system. Similarly, a multi-tile GPU, +which is essentially multiple GPUs in a single package, can deliver a multi-tile +scalability with the number of inference requests, while preserving the +single-tile latency. .. note:: - :doc:`OpenVINO performance hints ` is a recommended way for performance configuration, which is both device-agnostic and future-proof. + Balancing throughput and latency by manual configuration requires strong expertise + in this area. Instead, you should specify :doc:`performance hints ` + for ``compile_model``, which is a device-agnostic and future-proof option. -**When multiple models are to be used simultaneously**, consider running inference on separate devices for each of them. Finally, when multiple models are executed in parallel on a device, using additional ``ov::hint::model_priority`` may help to define relative priorities of the models. Refer to the documentation on the :doc:`OpenVINO feature support for devices <../../../../about-openvino/compatibility-and-support/supported-devices>` to check if your device supports the feature. +**For running multiple models simultaneously**, consider using separate devices for each of +them. When multiple models are executed in parallel on a device, use ``ov::hint::model_priority`` +to define relative priorities of the models. Note that this feature may not be available for +some devices. **First-Inference Latency and Model Load/Compile Time** -In some cases, model loading and compilation contribute to the "end-to-end" latency more than usual. -For example, when the model is used exactly once, or when it is unloaded and reloaded in a cycle, to free the memory for another inference due to on-device memory limitations. - -Such a "first-inference latency" scenario may pose an additional limitation on the model load\compilation time, as inference accelerators (other than the CPU) usually require a certain level of model compilation upon loading. -The :doc:`model caching ` option is a way to lessen the impact over multiple application runs. If model caching is not possible, for example, it may require write permissions for the application, the CPU offers the fastest model load time almost every time. +First-inference latency is the longest time the application requires to finish inference. +This means it includes the time to load and compile the model, which happens at the first +execution only. For some scenarios it may be a significant factor, for example, if the model is +always used just once or is unloaded after each run to free up the memory. + +In such cases the device choice is especially important. The CPU offers the fastest model load +time nearly every time. Other accelerators usually take longer to compile a model but may be +better for inference. In such cases, :doc:`Model caching ` +may reduce latency, as long as there are no additional limitations in write permissions +for the application. + +To improve "first-inference latency", you may choose between mapping the model into memory +(the default option) and reading it (the older solution). While mapping is better in most cases, +sometimes it may increase latency, especially when the model is located on a removable or a +network drive. To switch between the two, specify the +`ov::enable_mmap() <../../../api/ie_python_api/_autosummary/openvino.frontend.FrontEnd.html#openvino.frontend.FrontEnd.load>` +property for the ``ov::Core`` as either ``True`` or ``False``. + +You can also use :doc:`AUTO device selection inference mode <../inference-devices-and-modes/auto-device-selection>` +to deal with first-inference latency. +It starts inference on the CPU, while waiting for the proper accelerator to load +the model. At that point, it shifts to the new device seamlessly. -To improve common "first-inference latency" scenario, model reading was replaced with model mapping (using `mmap`) into a memory. But in some use cases (first of all, if model is located on removable or network drive) mapping may lead to latency increase. To switch mapping to reading, specify ``ov::enable_mmap(false)`` property for the ``ov::Core``. - -Another way of dealing with first-inference latency is using the :doc:`AUTO device selection inference mode <../inference-devices-and-modes/auto-device-selection>`. It starts inference on the CPU, while waiting for the actual accelerator to load the model. At that point, it shifts to the new device seamlessly. - -Finally, note that any :doc:`throughput-oriented options ` may significantly increase the model uptime. +.. note:: + Keep in mind that any :doc:`throughput-oriented options ` + may significantly increase inference time. diff --git a/docs/articles_en/openvino-workflow/running-inference/optimize-inference/optimizing-latency/model-caching-overview.rst b/docs/articles_en/openvino-workflow/running-inference/optimize-inference/optimizing-latency/model-caching-overview.rst index b3253f775bdb02..b1b6da190a0192 100644 --- a/docs/articles_en/openvino-workflow/running-inference/optimize-inference/optimizing-latency/model-caching-overview.rst +++ b/docs/articles_en/openvino-workflow/running-inference/optimize-inference/optimizing-latency/model-caching-overview.rst @@ -9,16 +9,16 @@ Model Caching Overview As described in :doc:`Integrate OpenVINO™ with Your Application <../../integrate-openvino-with-your-application>`, -a common application flow consists of the following steps: +a common workflow consists of the following steps: 1. | **Create a Core object**: | First step to manage available devices and read model objects 2. | **Read the Intermediate Representation**: - | Read an Intermediate Representation file into an object of the `ov::Model `__ + | Read an Intermediate Representation file into the `ov::Model `__ object 3. | **Prepare inputs and outputs**: | If needed, manipulate precision, memory layout, size or color format 4. | **Set configuration**: - | Pass device-specific loading configurations to the device + | Add device-specific loading configurations to the device 5. | **Compile and Load Network to device**: | Use the `ov::Core::compile_model() `__ method with a specific device 6. | **Set input data**: @@ -32,14 +32,15 @@ automatically and reuses it to significantly reduce the model compilation time. .. important:: - Not all devices support the network import/export feature. They will perform normally but will not + Not all devices support import/export of models. They will perform normally but will not enable the compilation stage speed-up. -Set "cache_dir" config option to enable model caching +Set configuration options +++++++++++++++++++++++++++++++++++++++++++++++++++++ -To enable model caching, the application must specify a folder to store the cached blobs: +| Use the ``device_name`` option to specify the inference device. +| Specify ``cache_dir`` to enable model caching. .. tab-set:: @@ -58,23 +59,25 @@ To enable model caching, the application must specify a folder to store the cach :fragment: [ov:caching:part0] -With this code, if the device specified by ``device_name`` supports import/export model capability, -a cached blob (the ``.cl_cache`` and ``.blob`` file for GPU and CPU respectively) is automatically +If the specified device supports import/export of models, +a cached blob file: ``.cl_cache`` (GPU) or ``.blob`` (CPU) is automatically created inside the ``/path/to/cache/dir`` folder. -If the device does not support the import/export capability, cache is not created and no error is thrown. +If the device does not support import/export of models, the cache is not +created and no error is thrown. -Note that the first ``compile_model`` operation takes slightly longer, as the cache needs to be created - -the compiled blob is saved into a cache file: +Note that the first ``compile_model`` operation takes slightly more time, +as the cache needs to be created - the compiled blob is saved into a file: .. image:: ../../../../assets/images/caching_enabled.svg -Make it even faster: use compile_model(modelPath) +Use optimized methods +++++++++++++++++++++++++++++++++++++++++++++++++++ -In some cases, applications do not need to customize inputs and outputs every time. Such application always -call ``model = core.read_model(...)``, then ``core.compile_model(model, ..)``, which can be further optimized. -For these cases, there is a more convenient API to compile the model in a single call, skipping the read step: +Applications do not always require an initial customization of inputs and +outputs, as they can call ``model = core.read_model(...)``, then ``core.compile_model(model, ..)``, +which can be further optimized. Thus, the model can be compiled conveniently in a single call, +skipping the read step: .. tab-set:: @@ -93,7 +96,7 @@ For these cases, there is a more convenient API to compile the model in a single :fragment: [ov:caching:part1] -With model caching enabled, the total load time is even shorter, if ``read_model`` is optimized as well. +The total load time is even shorter, when model caching is enabled and ``read_model`` is optimized as well. .. tab-set:: @@ -117,8 +120,9 @@ With model caching enabled, the total load time is even shorter, if ``read_model Advanced Examples ++++++++++++++++++++ -Not every device supports the network import/export capability. For those that don't, enabling caching has no effect. -To check in advance if a particular device supports model caching, your application can use the following code: +Enabling model caching has no effect when the specified device does not support +import/export of models. To check in advance if a particular device supports +model caching, use the following code in your application: .. tab-set:: @@ -136,10 +140,12 @@ To check in advance if a particular device supports model caching, your applicat :language: cpp :fragment: [ov:caching:part3] -Set "cache_encryption_callbacks" config option to enable cache encryption +Enable cache encryption +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -If model caching is enabled in the CPU Plugin, the model topology can be encrypted while it is saved to the cache and decrypted when it is loaded from the cache. Currently, this property can be set only in ``compile_model``. +If model caching is enabled in the CPU Plugin, set the "cache_encryption_callbacks" +config option to encrypt the model while caching it and decrypt it when +loading it from the cache. Currently, this property can be set only in ``compile_model``. .. tab-set:: @@ -157,7 +163,7 @@ If model caching is enabled in the CPU Plugin, the model topology can be encrypt :language: cpp :fragment: [ov:caching:part4] -If model caching is enabled in the GPU Plugin, the model topology can be encrypted while it is saved to the cache and decrypted when it is loaded from the cache. Full encryption only works when the ``CacheMode`` property is set to ``OPTIMIZE_SIZE``. +Full encryption only works when the ``CacheMode`` property is set to ``OPTIMIZE_SIZE``. .. tab-set:: @@ -177,4 +183,6 @@ If model caching is enabled in the GPU Plugin, the model topology can be encrypt .. important:: - Currently, this property is supported only by the CPU and GPU plugins. For other HW plugins, setting this property will not encrypt/decrypt the model topology in cache and will not affect performance. + Currently, encryption is supported only by the CPU and GPU plugins. Enabling this + feature for other HW plugins will not encrypt/decrypt model topology in the + cache and will not affect performance. diff --git a/docs/notebooks/llm-agent-functioncall-qwen-with-output.rst b/docs/notebooks/llm-agent-functioncall-qwen-with-output.rst index 051e83eff184bb..19b3f849a0f102 100644 --- a/docs/notebooks/llm-agent-functioncall-qwen-with-output.rst +++ b/docs/notebooks/llm-agent-functioncall-qwen-with-output.rst @@ -258,7 +258,7 @@ pipeline. You can get additional inference speed improvement with `Dynamic Quantization of activations and KV-cache quantization on -CPU `__. +CPU `__. These options can be enabled with ``ov_config`` as follows: .. code:: ipython3 diff --git a/docs/notebooks/llm-agent-react-langchain-with-output.rst b/docs/notebooks/llm-agent-react-langchain-with-output.rst index 7313d4c454c42a..34c81ef6e11e75 100644 --- a/docs/notebooks/llm-agent-react-langchain-with-output.rst +++ b/docs/notebooks/llm-agent-react-langchain-with-output.rst @@ -438,7 +438,7 @@ information `__. You can get additional inference speed improvement with `Dynamic Quantization of activations and KV-cache quantization on -CPU `__. +CPU `__. These options can be enabled with ``ov_config`` as follows: .. code:: ipython3 diff --git a/docs/notebooks/multilora-image-generation-with-output.rst b/docs/notebooks/multilora-image-generation-with-output.rst index f6445e5a2ec1f2..e2da1edafdd8f6 100644 --- a/docs/notebooks/multilora-image-generation-with-output.rst +++ b/docs/notebooks/multilora-image-generation-with-output.rst @@ -144,7 +144,7 @@ saved on disk before export. For avoiding this, we will use ``export_from_model`` function that accepts initialized model. Additionally, for using model with OpenVINO GenAI, we need to export tokenizers to OpenVINO format using `OpenVINO -Tokenizers `__ +Tokenizers `__ library. In this tutorial we will use `Stable Diffusion diff --git a/docs/notebooks/speculative-sampling-with-output.rst b/docs/notebooks/speculative-sampling-with-output.rst index 8ca9ca5bc7002c..8dd300fa4bbaff 100644 --- a/docs/notebooks/speculative-sampling-with-output.rst +++ b/docs/notebooks/speculative-sampling-with-output.rst @@ -136,7 +136,7 @@ In case, if you want run own models, you should convert them using Optimum `__ library accelerated by OpenVINO integration. More details about model preparation can be found in `OpenVINO LLM inference -guide `__ +guide `__ .. code:: ipython3 diff --git a/docs/notebooks/text-to-image-genai-with-output.rst b/docs/notebooks/text-to-image-genai-with-output.rst index a0f0af9ef41538..d43b900d9133db 100644 --- a/docs/notebooks/text-to-image-genai-with-output.rst +++ b/docs/notebooks/text-to-image-genai-with-output.rst @@ -23,7 +23,7 @@ the Hugging Face Transformers library to the OpenVINO™ IR format. For more details, refer to the `Hugging Face Optimum Intel documentation `__. 2. Run inference using the `Text-to-Image Generation -pipeline `__ +pipeline `__ from OpenVINO GenAI. diff --git a/docs/sphinx_setup/_static/download/GenAI_Quick_Start_Guide.pdf b/docs/sphinx_setup/_static/download/GenAI_Quick_Start_Guide.pdf index c5632a7e3f9627..2046f7d9427421 100644 Binary files a/docs/sphinx_setup/_static/download/GenAI_Quick_Start_Guide.pdf and b/docs/sphinx_setup/_static/download/GenAI_Quick_Start_Guide.pdf differ diff --git a/docs/sphinx_setup/api/nodejs_api/openvino-node/interfaces/Tensor.rst b/docs/sphinx_setup/api/nodejs_api/openvino-node/interfaces/Tensor.rst index 9b0e19b559cdf8..8e51702aa1baca 100644 --- a/docs/sphinx_setup/api/nodejs_api/openvino-node/interfaces/Tensor.rst +++ b/docs/sphinx_setup/api/nodejs_api/openvino-node/interfaces/Tensor.rst @@ -9,6 +9,7 @@ Interface Tensor getData(): SupportedTypedArray; getShape(): number[]; getSize(): number; + isContinuous(): boolean; } @@ -116,3 +117,19 @@ Methods * **Defined in:** `addon.ts:421 `__ + +.. rubric:: isContinuous + +* + + .. code-block:: ts + + isContinuous(): boolean; + + Reports whether the tensor is continuous or not. + + * **Returns:** boolean + + * **Defined in:** + `addon.ts:425 `__ + \ No newline at end of file