Skip to content

Releases: intel/intel-extension-for-transformers

Intel® Extension for Transformers v1.1 Release

14 Jul 10:26
4269f96
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Productivity
  • Examples
  • Bug Fixing
  • Documentation

Highlights

  • Created NeuralChat, the first 7B commercially friendly chat model ranked in top of LLM leaderboard
  • Supported efficient fine-tuning and inference on Xeon SPR and Habana Gaudi
  • Enabled 4-bits LLM inference in plain C++ implementation, outperforming llama.cpp
  • Supported quantization for broad LLMs with the improved lm-evaluation-harness for multiple frameworks and data types

Features

  • Model Optimization
    • Language modeling quantization for OPT-2.7B, OPT-6.7B, LLAMA-7B (commit 6a9608), MPT-7B and Falcon-7B (commit f6ca74)
    • Text2text-generation quantization for T5, Flan-T5 (commit a9b69b)
    • Text-generation quantization for Bloom (commit e44270), MPT (commit 469ac6)
    • Enable QAT for Stable Diffusion (commit 2e2efd)
    • Replace PyTorch Pruner with INC Pruner (commit 9ea1e3)
  • Transformers-accelerated Neural Engine
  • Transformers-accelerated Libraries
    • MHA kernels for static, dynamic quantization and bf16 (commit 0d0932, e61e4b)
    • Support dynamic quantization matmul and post-op (commit 4cb9e4, cf0400, 9acfe1)
    • Int4 weight-only kernels (commit 3b7665) and fusion (commit f00d87)
    • Support dynamic quantization op (commit 6fcc15)
    • Add AVX2 kernels for Windows (commit bc313c)

Productivity

  • Enable Lora fine-tuning(commit 664f4b), multi-nodes fine-tuning(commit 6288fd) and Xeon, Habana inference (commit 8ea55b) for Chatbot
  • Enable docker for Chatbot (commit 6b9522, 37b455)
  • Support Parameter-Efficient Fine-Tuning (PEFT) (commit 27bd7f)
  • Update Torch and TensorFlow (commit f54817)
  • Add Harness evaluation for PyTorch text-generation/language modeling (commit 736921, c7c557, b492f5) and onnx (commit a944fa)
  • Add summarization evaluation for PyTorch (commit 062e62)

Examples

  • Early Exit: TangoBERT, Separating Weights for Early-Exit Transformers (SWEET) (commit dfbdc5, c0eaa5)
  • Electra fp32 & bf16 inference (commit e09c96)
  • GPT-NeoX and Dolly-v2-7B text-generation inference (commit 402bb9)
  • Stable Diffusion v2.1 inference (commit 5affab), image to image(commit a13e11), inference with dynamic quantization (commit bfcb2e)
  • Onnx whisper-large quantization (commit 038be0)
  • 8-layers MiniLM inference (commit 0dd104)
  • Add compression aware training (commit dfb53f), sparse aware training(commit 7b28ef) and fine-tuning and inference workflows (commit bf666c)

Bug Fixing

  • Fix Neural Engine error with gcc13 (commit 37a4a3) and GPU compilation error (commit 0f38eb)
  • Fix quantization for transfor...
Read more

Intel® Extension for Transformers v1.0.1 Release

02 Jun 09:31
Compare
Choose a tag to compare
  • Bug Fixing
  • Improvement

Bug Fixing

  • Fix BERT Large accuracy issue (commit ddc4a5)
  • Fix Dynamic Quantization UnitTest (commit d83040)

Improvement

  • Enable new fusion patterns for GPT-J (commit c73605 )
  • ChatBot Refine Data Load and Data Clean (commit f70205, commit 0997ac)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04 & Windows 10
  • Python 3.8, 3.9
  • TensorFlow 2.10.1
  • PyTorch 1.13.1+cpu
  • Intel® Extension for PyTorch 1.13.1+cpu

Intel® Extension for Transformers v1.0.0 Release

04 Apr 17:03
c9ec6a4
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Productivity
  • Examples
  • Bug Fixing
  • Documentation

Highlights

  • Provide the optimal model packages for large language model (LLM) such as GPT-J, GPT-NEOX, T5-large/base, Flan-T5, and Stable Diffusion
  • Provide the end-to-end optimized workflows such as SetFit-based sentiment analysis, Document Level Sentiment Analysis (DLSA), and Length Adaptive Transformer for inference
  • Support NeuralChat, a custom Chatbot based on domain knowledge fine-tuning and demonstrate less than one hour fine-tuning with PEFT on 4 SPR nodes
  • Demonstrate the industry-leading sparse model inference solution in MLPerf v3.0 open submission with up to 1.6x over other submissions

Features

  • Model Optimization
  • Transformers-accelerated Neural Engine
    • Support runtime dynamic quantization (commit 46fa 41c4)
    • Enable GPT-J FP32/BF16/INT8 text generation inference (commit ac2c)
    • Enable Stable Diffusion BF16/FP32 text-to-image inference (commit 56cf)
    • Support OpenNMT FP32 to ONNX with good accuracy (commit 34d8)
  • Transformers-accelerated Libraries
    • CPU Backend: MHA fusion for LLM to improve performance (commit 7c3d)
    • GPU Backend: Supports OpenCL infrastructure, and provides matmul implementation (commit 5a60)

Productivity

  • Support native PyTorch model as input of Neural Engine (commit bc38)
  • Refine the Benchmark API to provide apple-to-apple benchmark ability. (commit e135)
  • Simplify end-to-end example usage (commit 6b9c)
  • N in M/ N x M PyTorch Pruning API enhancement (commit da4d)
  • Deliver engine-only wheel with size reduce 60% (commit 02ac)

Examples

  • End-to-end solution for Length Adaptive with Neural Engine, achieves over 11x speed up compared with BERT Base on SPR (commit 95c6)
  • End-to-end Documentation Level Sentiment Analysis(DLSA) workflow (commit 154a)
  • N in M/ N x M BERT Large and BERT Base pruning in PyTorch (commit da4d)
  • Sparse pruning example for Longformer with 80% sparsity (commit 5c5a)
  • Distillation for quantization for BERT and Stable Diffusion (commit 8856 4457)
  • Smooth quantization with BLOOM (commit edc9)
  • Longformer quantization with question-answering task (commit 8805)
  • Provide SETFIT workflow notebook (commit 6b9c 2851)
  • Support Text Generation task (commit c593)

Bug Fixing

  • Enhance BERT QAT tuning duration (commit 6b9c)
  • Fix Length Adaptive Transformer regression (commit 5473)
  • Fix accelerated lib compile error when enabling Vtune (commit b5cd)

Documentation

  • Refine contents of all readme files
  • API Helper based on GitHub io page (commit e107 )
  • devcatalog for Mt. Whitney (commit acb6)

Validated Configurations

  • Centos 8.4 & Ubuntu 20.04 & Windows 10
  • Python 3.7, 3.8, 3.9, 3.10
  • Intel® Extension for TensorFlow 2.10.1, 2.11.0
  • PyTorch 1.12.0+cpu, 1.13.0+cpu
  • Intel® Extension for PyTorch 1.12.0+cpu,1.13.0+cpu

Intel® Extension for Transformers v1.0b Release

12 Dec 02:56
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Productivity
  • Examples
  • Bug Fixing
  • Documentation

Highlights

  • Intel® Extension for Transformers provides more compression examples for popular applications like Stable Diffusion. For Stable Diffusion, we support INT8 quantization with PyTorch and BF16 fine-tune with Intel ® Extension for PyTorch.

Features

  • Pruning/Sparsity
    • Support structured sparsity pattern N:M on PyTorch (25d5e4b)
    • Support structured sparsity pattern NxM on PyTorch (25d5e4b)
  • Transformers-accelerated Neural Engine
    • Support inference on Windows (fc580d5)
  • Transformers-accelerated Libraries
    • Support INT8 Softmax operator (fece837)

Productivity

  • Simplify the integration with Alibaba BladeDISC

Examples

Bug Fixing

  • Fix Protobuf and Onnx version dependency issue
  • Fix memory leak in Neural Engine

Documentation

  • Create Notebook for Pruning/Compression Orchestration/IPEX Quantization
  • Refine the user guide and compression example

Validated Configurations

  • Centos 8.4 & Ubuntu 20.04 & Windows 10
  • Python 3.7, 3.8, 3.9
  • Intel® Extension for TensorFlow 2.9.1, 2.10.0
  • PyTorch 1.11.0+cpu,1.12.0+cpu, 1.13.0+cpu, Intel® Extension for PyTorch 1.12.0+cpu ,1.13.0+cpu

Intel® Extension for Transformers v1.0a Release

23 Nov 16:23
59544b0
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Productivity
  • Examples

Highlights

  • Intel® Extension for Transformers provides a rich set of model compression techniques and a leading sparsity-aware libraries and neural engine to accelerate the inference of Transformer-based models on Intel platforms. We published 2 papers on NeurIPS’2022 with the source code released:
    • Fast DistilBERT on CPUs: outperform existing state-of-the-art Neural Magic's DeepSparse runtime performance by up to 50%, and deliver 7x better performance on c6i.12xlarge (Ice Lake) than c6a.12xlarge (AMD Milan)
    • QuaLA-MiniLM: outperform BERT-base with ~3x reduced size and demonstrate up to 8.8x speedup with <1% accuracy loss on SQuAD1.1 task

Features

  • Pruning/Sparsity
    • Support Distributed Pruning on PyTorch
    • Support Distributed Pruning on TensorFlow
  • Quantization
    • Support Distributed Quantization on PyTorch
    • Support Distributed Quantization on TensorFlow
  • Distillation
    • Support Distributed Distillation on PyTorch
    • Support Distributed Distillation on TensorFlow
  • Compression Orchestration
    • Support Distributed Orchestration on PyTorch
  • Neural Architecture Search (NAS)
    • Support auto distillation with NAS and flash distillation on PyTorch
  • Length Adaptive Transformer (LAT)
    • Support Dynamic Transformer on SQuAD1.1 on PyTorch
  • Transformers-accelerated Neural Engine
    • Support inference with sparse GEMM fusion patterns
    • Support automatic benchmarking of sparse and dense mixed model
  • Transformers-accelerated Libraries
    • Support 1x4 block-wise sparse VNNI-INT8 GEMM kernels with post-ops
    • Support 1x16 block-wise sparse AMX-BF16 GEMM kernels with post-ops

Productivity

  • Support seamless Transformers-extended APIs
  • Support experimental model conversion from PyTorch INT8 model to ONNX INT8
  • Support VTune performance tracing for sparse GEMM kernels

Examples

Validated Configurations

  • Centos 8.4 & Ubuntu 20.04
  • Python 3.7, 3.8, 3.9, 3.10
  • TensorFlow 2.9.1, 2.10.0, Intel® Extension for TensorFlow 2.9.1, 2.10.0
  • PyTorch 1.12.0+cpu, 1.13.0+cpu, Intel® Extension for PyTorch 1.12.0