Skip to content

v1.0.0

Latest

Choose a tag to compare

@GregoryComer GregoryComer released this 18 Oct 01:03
· 541 commits to main since this release
8c84780

We're excited to announce the release of ExecuTorch 1.0! This release marks the official transition out of beta status and focuses on stability, ease of use, and polish, as well as expanded support for multi-modal language models .

Highlights

  • Expanded platform support for ARM64 Linux and experimental support for native x86 Windows, both for Python model preparation and runtime builds.
  • New APIs for running multimodal models on Android, iOS, and desktop.
    1. Supported models include Voxtral, Gemma3, and more.
  • LoRA inference capabilities - export multiple LoRA .pte files that share a single set of foundation weights.
  • 4-bit HQQ quantization support for improved model accuracy.
  • Python and Maven (Android) package variants for Vulkan and QNN backends.
  • Experimental JavaScript support for the ExecuTorch runtime via WASM.

API Changes

  • Expanded LLM inference APIs in C++, Java, Objective-C, and Swift, including support for vision and audio modalities.
  • New ahead-of-time and runtime APIs to support shared weights between .pte files.

Build

  • Add support for top-level CMake targets for backends, kernels, and extensions, including kernel selective build via CMake options.
  • Remove buck as a dependency for CMake builds.
  • Link Vulkan and QNN backends in Python runtime builds when enabled.
  • Expose CMake presets for common platforms and use cases.
  • Consolidate and clean up CMake options and build structure.

Backend Delegates

Arm

  • Ethos-U backend
    • The Ethos-U backend has reached production quality in this release, with stable operator coverage, CI validation, and verified integration for embedded deployments.
  • VGF / VKML backend
    • ExecuTorch’s Arm VGF backend is now documented and aligned with Arm’s ML SDK for Vulkan (VKML).
    • Includes end-to-end lowering to VGF and running/extracting VGF from .pte.
    • Correct handling for rank-0 tensors and improved input/output handling in the portable executor runner.
    • Runtime delegate added for Linux and macOS targets with the VKML emulator.
    • Comprehensive testing of the VGF AoT flow introduced, ensuring functional and performance validation across representative models
  • Partitioner & API polish
    • Partitioner/compile interfaces simplified and refactored for easier extension and maintainability.
  • Expanded TOSA coverage
    • Added and verified mappings for arithmetic, pooling, and reduction operators, improving model compatibility and TOSA lowering fidelity.
    • Ahead-of-time (AoT) path extended significantly to support the full set of Arm backend operators (now exceeding 80% Edge dialect coverage).
  • Quantization pipeline
    • Broader data type support, refined logging, and clearer documentation.
    • Experimental 16A8W data layout support added, with tests for linear, add, and related INT16 ops.
  • Portable runner enhancements
    • Added support for models with non-tensor inputs, broadening applicability to control-flow and hybrid workloads.
  • Testing & CI
    • Expanded unit tests, out-of-the-box tests, and model verification scripts.
    • Broader CI coverage for quantized and general workloads.

Cadence

  • Extend support for audio operators to 50+, covering most audio models.
  • Support the first vision operator (softmax).
  • Various improvements to the AoT compiler.
  • Optimized operators give 20-50x speedups compared to vanilla on 7 OSS models.

Core ML

  • Support block-wise, channel-wise, and pallet/codebook quantization of linear and embedding layers with torchao APIs.
  • Support for enumerated shapes for limited support dynamism on the ANE.
  • Improve support for dim order.
  • Fix crashes for several operators with mixed dtypes.
  • Support models and partitions with no user inputs.
  • Log when operator nodes are not partitioned and why.
  • Add a lower_full_graph partitioner option to assert that all model operators are running on Core ML.
  • Add a take_over_constant_data option to tell CoreML delegate to not consume weight data.
  • Various small bug fixes and improvements.

MediaTek

  • Support weight sharing for llama.
  • Support qwen, phi, and gemma models.
  • Update MTK tool version.

MPS

  • No major updates since 0.7.

NXP

  • Support for additional operators: abs, _adaptive_avg_pool2d, addmm, add.Tensor, avg_pool2d, cat, clone, constant_pad_nd, convolution, hardtanh, max_pool2d, max_pool2d_with_indices, mean.dim, mm, relu, tanh, view_copy, and sigmoid.
  • Support i.MX RT700 platform with MCUXpresso SDK 25.06.
  • Add model support for CNN models – MobileNetV2, CifarNet, and MLPerf Tiny.
  • Update documentation to fix links and add a tutorial for on device deployment.

OpenVINO

  • New backend delegate for inference acceleration on Intel platforms for CPU, GPU and NPU.
  • Support for Intel CPUs, integrated GPUs, and NPUs. Validated on Intel Core Ultra Series 2 AI PCs.
  • Run Llama, Stable Diffusion LCM, and Yolo models.
  • Expanded aten operator coverage for broader model compatibility.
  • Introduce 4-bit (INT4) weight compression for LLMs and 8-bit (INT8) quantization for vision models via OpenVINOQuantizer.
  • Add new end-to-end examples for Yolo and Stable Diffusion, instructions for exporting Llama models, and update build documentation to use the official OpenVINO release package.

Qualcomm

  • Support 26 new models (Albert, Bert, Distilbert, Eurobert, Roberta, Gemma3-1B, Llama-3.2-1B-instruct, Llama-3.2-3B-instruct, Qwen2.5-0.5B, Qwen2.5-1.5B, Qwen3-0.6B, Qwen3-1.7B, Phi-4-mini-instruct, SmolLM2-135M, SmolLM3-3B, Google's T5k, k, Convolutional Vision Transformer, Distilled Data-efficient Image Transformer, Dit: Document Image Transformer, EfficientNet, Focalnet, Apple's MobileViT and MobileViTv2, Pvt: Pyramid Vision Transformer, Swin: Swin Transformer, Whisper).
    1. Optimization including MaskedSoftMax
    2. SeqMSE
    3. TorchAO SpinQuant
    4. Mixed precision, including linear layers and kv cache.
    5. Support for lookahead decoding.
  • Support for custom operators.
  • Support for 92/116 HTP operators.
  • Support for automatic installation of model preparation dependencies with the executorch pip package.
  • Add runtime options to adjust log level, performance mode and profiling level in runtime
  • Upgrade to QNN SDK 2.37.

Samsung

  • New backend delegate for Samsung Exynos (#13677), which enables inference of ExecuTorch models via the DSP/NPU of Samsung SoCs.
    • Currently supported chipsets:
      • Exynos 2500 (E9955)
    • Includes support for 44 operators.
  • Enable support for statically 8-bit quantized models (#14464).
    • Quantization support included for 34 operators .
  • Currently validated models:
    • DeepLab V3, EDSR, Inception V3, Inception V4, MobileNet V2, MobileNet V3, ResNet 18, ResNet 50, ViT, Wav2Letter

Vulkan

  • We have partnered with Samsung's GPU compute team to collaborate on delivering general and Samsung GPU specific optimizations to the ExecuTorch Vulkan backend; the following optimizations/improvements were introduced via this partnership:
    1. Introduce operator fusions for fusing clamp to convolution, binary, and other clamp operators (#14960).
    2. Compute shader optimizations for convolution (#14724).
    3. Add support for 2D reduction operators (#12860).
  • Add support for 8da4w (8-bit dynamically per-channel quantized activations, 4-bit per-group quantized weights) quantized linear layers, whose implementation leverages the accelerated integer dot product Vulkan extension
  • Introduce an automatic half precision mode; setting the force_fp16 compile option will cause the Vulkan backend to internally convert fp32 tensors to fp16
    1. Inputs are still expected to be fp32 (they will be automatically converted to fp16 within the delegate).
    2. Outputs will be converted back to fp32 as they are returned.
  • Add support for high dimensional tensors; the Vulkan backend now supports tensors with <= 8 dimensions; previously only tensors with <= 4 dimensions could be represented.
  • Reduce peak memory usage during model loading by ~2-3x via employing lazy allocation of GPU resources - see #13474 for more details.
  • Improved operator support:
    1. Add support for aten.expand_copy.
    2. Add support for aten.max_pool2d; previously only aten.max_pool2d_with_indices was supported.
    3. aten.cat can now handle any number of input tensors; previously, only up to 3 tensors were allowed.
    4. Add support for grouped convolutions.
  • Operator-level optimizations:
    1. Re-implemented SDPA with optimized compute shaders (#14130).
    2. Improve local work group size selection for matrix multiplication and linear (#13378).

XNNPACK

  • Support batch norm fusion with linear operators.
  • Use a new group-based partitioning algorithm to resolve occasional errors with partitioning of quantized graphs.
  • Support channels-last dim order tensors.
  • Optimize transpose pairs.
  • Use KlediAI kernels on ARM platforms for improved GEMM performance.
  • Support runtime control of the workspace sharing option for memory reduction via the backend options interface.
  • Update the XNNPACK library version.
  • Various bug fixes and optimizations.

Android

  • Support image and audio inputs to multi-modal models.
  • Add an API for loading separate data files (PTD).
  • Add an API for inspecting method metadata.
  • Update the Java API to throw exceptions in more cases, instead of returning empty results or hitting native crashes.
  • Provide Vulkan and QNN backend AAR packages in the Maven repository.

iOS

  • Various Swift and Objective-C Runtime API improvements and stabilization.
  • Add ExecuTorchLLM framework providing experimental Swift and Objective-C APIs for Text and Multimodal LLM generation.

Developer Tools

  • Intermediate numerical debugging via the calculate_numeric_gap API:
    • Validate intermediate outputs on delegated models.
    • Leverage exported module from torch.export() for labels.
    • Support partial comparison when two model graphs are not exactly the same.
  • ETRecord generation support with to_edge, to_edge_transform_and_lower, and executorch.export() lowering APIs.
  • ETDump generation support in the executorch.runtime Python APIs when compiled in.

Model Support

  • Add examples for Voxtral, Gemma3, LFM2, Yolo, and SmolLM.

Ops and kernels

  • Support for DType-selective build with CMake.
  • Half/BFloat16 parity with Float32 in the portable and optimized kernel libraries.
  • Non-fatal error handling for unsupported dtypes.
  • Add _upsample_biliinear2d_aa operator kernel.