Release Intel® Extension for Transformers v1.0.0 Release · intel/intel-extension-for-transformers

Highlights

Provide the optimal model packages for large language model (LLM) such as GPT-J, GPT-NEOX, T5-large/base, Flan-T5, and Stable Diffusion
Provide the end-to-end optimized workflows such as SetFit-based sentiment analysis, Document Level Sentiment Analysis (DLSA), and Length Adaptive Transformer for inference
Support NeuralChat, a custom Chatbot based on domain knowledge fine-tuning and demonstrate less than one hour fine-tuning with PEFT on 4 SPR nodes
Demonstrate the industry-leading sparse model inference solution in MLPerf v3.0 open submission with up to 1.6x over other submissions

Features

Model Optimization
- LLM quantization including GPT-J (6B), GPT-NEOX (2.7B), T5-large, T5-base, Flan-T5, BLOOM-176B
- Enable basic Neural Architecture Search (commit 6cae)
Transformers-accelerated Neural Engine
- Support runtime dynamic quantization (commit 46fa 41c4)
- Enable GPT-J FP32/BF16/INT8 text generation inference (commit ac2c)
- Enable Stable Diffusion BF16/FP32 text-to-image inference (commit 56cf)
- Support OpenNMT FP32 to ONNX with good accuracy (commit 34d8)
Transformers-accelerated Libraries
- CPU Backend: MHA fusion for LLM to improve performance (commit 7c3d)
- GPU Backend: Supports OpenCL infrastructure, and provides matmul implementation (commit 5a60)

Productivity

Support native PyTorch model as input of Neural Engine (commit bc38)
Refine the Benchmark API to provide apple-to-apple benchmark ability. (commit e135)
Simplify end-to-end example usage (commit 6b9c)
N in M/ N x M PyTorch Pruning API enhancement (commit da4d)
Deliver engine-only wheel with size reduce 60% (commit 02ac)

Examples

End-to-end solution for Length Adaptive with Neural Engine, achieves over 11x speed up compared with BERT Base on SPR (commit 95c6)
End-to-end Documentation Level Sentiment Analysis(DLSA) workflow (commit 154a)
N in M/ N x M BERT Large and BERT Base pruning in PyTorch (commit da4d)
Sparse pruning example for Longformer with 80% sparsity (commit 5c5a)
Distillation for quantization for BERT and Stable Diffusion (commit 8856 4457)
Smooth quantization with BLOOM (commit edc9)
Longformer quantization with question-answering task (commit 8805)
Provide SETFIT workflow notebook (commit 6b9c 2851)
Support Text Generation task (commit c593)

Bug Fixing

Documentation

Validated Configurations

Provide feedback