Release Intel® Extension for Transformers v1.2 Release · intel/intel-extension-for-transformers

Highlights
Features
Productivity
Examples
Bug Fixing
API Modification
Documentation

Highlights

NeuralChat has been showcased in Intel Innovation’23 Keynote and Google Cloud Next '23 to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors. The chatbot solution has been integrated into LLM as a service (LLMaaS), providing the smooth user experience to build GenAI/LLM applications by leveraging Intel latest Xeon Scalable Processors and Gaudi2 from Intel Developer Cloud.
NeuralChat offers a comprehensive pipeline to build an end-to-end chatbot applications with a rich set of pluggable features such as speech cloning & interaction (EN/CN), knowledge retrieval, query caching, and security guardrail. These features allow you to create a custom chatbot from scratch within minutes, therefore significantly improving the chatbot development productivity.
LLM runtime extends Transformers API to provide seamless weight-only low precision inference for Hugging Face transformer-based models including LLMs. We improve LLM runtime with more comprehensive kernel support on low precisions (INT8/FP8/INT4/FP4/NF4), while keeping full compatibility with GGML. LLM runtime delivers 25ms/token with int4 llama, 22ms/token with int4 GPT-J on Intel Xeon Scalable Processors. Therefore, providing a complementary and extremely optimized LLM runtime solution for Intel architectures.

Features

Neural Chat
- Support ASR/TTS on CPU and HPU (fb619e5 56685a )
- Added docker for chatbot on Xeon SPR and Habana Gaudi (59fc92e ad2ee1)
- Refine Chatbot workflow and use NeuralChat API (53bed4 e95fc32 )
- Implement API python sdk, weight only quantization and AMP for Neural-Chat. (08ba5d85 )
Model Optimization
- Add GPTQ/TEQ/WOQ quantization with plenty examples (b4b2fcc 1bcab14 )
- Enhance the ITREX quantization API as well as LLMRuntime, users can now obtain a quantized model using AutoModelForCausalLM.from_pretrained. (be651b f4dc78 )
- Support GPT-J pruning (802ec0d2 )
LLM Runtime
- Enable FFN fusion LLMs (277108 )
- Enabled Tensor Parallelism for 4bits GPT-J on 2 sockets (fe0d65c )
- Implement AMX INT8/BF16 MHA (c314d6c )
- Support asymmetric models in LLM Runtime (93ca55 )
- Jblas and Qbits support nf4 and s4-fullrange weight compression. (ff7af86 )
- Enhance Beam-search early-stopping mechanisms (cd4c33d )

Productivity

ITREX moved to fully public development, welcome to contribute to us. (90ca31 )
Support streaming mode for neuralchat (f5892ec )
Support Direct Preference Optimization to improve accuracy. (50b5b9 )
Support query cache for chatbot (1b4463 )
Weight-only support for Pytorch Framework (3a064fa )
Provide mix INT8 & BF16 inference mode for stable diffusion. (bd2973 )
Supported Stable Diffusion v1.4/v1.5/v2.1 and QAT inference on Linux/windows (02cc59 )
Update Onednn to v3.3 (e6d8a4 )
Weight-only kernel support INT8 quantization (6ce8b13 )
Enable flash attention like kernel in weight-only (0ef3942)
Weight-only kernel ISA based dispatcher (ff7af86 )
Support 4bits per-channel quantization (4e164a8 )

Examples

Add Falcon, ChatGLM CLM examples (c3b196 )
Enabled code-generation example with Docker and integrated bigcode/lm-eval (c569fd5 0b3450 )
Weight-only ChatGLM-V1/V2, BLOOM-7B, MPT-30B, Llama2-7B, Llama2-70B, Falcon-40B, Dolly-V2, Starcoder-15B and OPT series examples (9a2cfa 793629 ac5744f 96d424 d4fb27 2a82ee0 e4eb09f d4fb27 578162 f5df02 )
Support Intel/neural-chat-7b-v1-1 model in ChatBot (126d07b )
Add fine-tuning for Text-to-Speech(TTS) task in NeuralChat (1dac9c6 e39fec90 )
Support GPT-J NeuralChat in Habana (9ef6ad8 )
Enable MPT peft LORA finetune in Gaudi (3dc184e )
Add code-generation finetuning pipeline (c070a8 )
E2E Talking Bot example on Windows PC (be2a267 )

Bug Fixing

Fixed issues from Cobalt, the 3-party company is hired by Intel to do Penetration testing. (51a1b88 )
Fix windows compile issues (bffa1b0 )
Fix ordinals and conjunctions in tts normalizer (0892f8a )
Fix Habana finetuning issues (2bbcf51 )
Fix bugs in RAG code for converting the prompt (bfad5c )
Fix normalizer: year, punctuation after number, end token (775a12 )
Fix Graph Model quantization on AVX2-only Platforms (3c84ec6 )

API Modification

Update the input of the 'device' parameter in the NeuralChat fine-tuning API, changing it from 'habana' to 'hpu' (96dabb0)
Change default values of do_lm_eval, lora_all_linear and use_fast_tokenizer in ModelArguments from False to True. (52f9f74)

Documentation

Add notebooks for optimization of NeuralChat on SPR, HPU, and A100. (52f9f74 7218806 d156e9a )
Add Readme and UT for NeuralChat. (daff796 49336d3 9b81f05 )

Validated Configurations

Provide feedback