Releases: quic/efficient-transformers
release/v1.20.0
Newly Onboarded Models
• Added support for Llama-4-Scout-17B-16E-Instruct
o Sample script for Text only (Recommended class for testing is QEFFAutoModelForImageTextToText).
o Sample script for Image + Text (Recommended class for testing is QEFFAutoModelForImageTextToText).
o Single QPC + Dual QPC support, (please check the comment section of example script for running single QPC).
o Added support for chunk attention in Llama4.
o Continuous batching and multi batch execution is planned for rel#1.21.0.
o With the redefined interface between QEFF and VLLM, we should be able to run the multiple images in single prompt, please follow (example) and see sample completion below,
• Added support for Grok-1
o Since architecture for this model is Grok1ModelForCausalLM, so it can be executed using QEffAutoModelForCausalLM.
• Added support for Gemma3
o Sample script for Text only (Recommended class for testing is QEFFAutoModelForImageTextToText).
o Sample script for Image + Text.
o Added support for sliding window.
o Continuous batching and multi batch execution is planned for rel#1.21.0
• Added support for Granite Vision models
o Sample script
• Added support for Granite MOE models
New Features
• Upgrading Transformer version to 4.51.3.
• SpD, multiprojection heads
o Implemented post-attention hidden size projections to speculate tokens ahead of the base model.
• Adding compilation support for io_encrypt flag
o Added support for Model-IP I/O encryption feature using qaic-exec (compile only).
o Users can now directly pass the --io-encrypt flag in both high-level APIs(compile) and command-line APIs (infer and compile).
• Support for sperate prefill and decode compilation
o Added support for separate prefill and decode compilation for encoder (vision) and language models. This feature will be utilized for disaggregated serving.
• New features for Embedding Models –
o Flexible Pooling configuration:
User can specify popular pooling strategies via string identifiers or provide custom pooling methods.
It enables seamless integration of pooling at the end of the embedding model, offering flexibility for various use cases. Pooling will also run on AI 100 for improved performance.
Sample script
Added support for sentence embedding.
• With pooling added, Efficient-Transformers now enables direct sentence embedding generation on AI 100, improving efficiency and semantic quality for downstream tasks.
o Support for compilation with multiple sequence lengths.
Users can specify a single or list of seq_len values during compilation (example).
At generation time, the closest greater or equal seq_len graph from the QPC is auto selected for optimal execution.
• Added support for On Device Sampling for CausalLM models.
o Sampling now runs directly on the QAIC device, reducing host-device communication and boosting inference throughput and scalability.
o Documentation and Usage guide.
• Added support for SwiftKV model (Snowflake/Llama-3.1-SwiftKV-8B-Instruct)
o Added support for both continuous and non-continuous batching execution in SwiftKV.
o Since architecture for this model is LlamaSwiftKVForCausalLM, so it can be executed using QEffAutoModelForCausalLM
• Added support for execution of GGUF models, (without quantized weights)
o sample script.
• Added support for compressed quantization status for FP8 model.
o Infermatic/Llama-3.3-70B-Instruct-FP8-Dynamic · Hugging Face
• QNN updates –
o Updated the QNN custom IO generation method for adhering to compiler changes.
o Added --target_backend AIC as default parameter in QNN Converter.
Fine Tuning
• Added Bert FT support, doc
• Documentation and a code template to run fine tuning on custom dataset.
• Added --help option available for usage of training parameters.
• Added support for gradient checkpointing in the finetuning script
• Added support for Passing device type in torch GradScaler.
• Detailed documentation is here
Upcoming models
• Qwen3
• Mistral 3.1
Upcoming features
• Compute context length support planned for 1.21.0)
• Support for passing MDP file to compiler during compilation. (planned as bug-fix in 1.20.0).
• Upgrading the ONNX dependency is required to address a security vulnerability identified in the current version of ONNX.
o (onnx==1.18.0, onnxruntime==1.22, onnxscript==0.2.5, protobuff ==6.31.0) (planned for 1.21.0)
• Support for -inf for pad tokens, for optimized softmax handling in compiler. (planned for 1.21.0).
Release V1.19.3
Added Features
- Vision Language Model
- Speech Sequence to Sequence Model
- Support for FP8 Execution
- Prompt-Lookup Decoding sample script.