Releases: huggingface/text-embeddings-inference
v1.8.2
🔧 Fixed Intel MKL Support
Since Text Embeddings Inference (TEI) v1.7.0, Intel MKL support had been broken due to changes in the candle
dependency. Neither static-linking
nor dynamic-linking
worked correctly, which caused models using Intel MKL on CPU to fail with errors such as: "Intel oneMKL ERROR: Parameter 13 was incorrect on entry to SGEMM".
Starting with v1.8.2, this issue has been resolved by fixing how the intel-mkl-src
dependency is defined. Both features, static-linking
and dynamic-linking
(the default), now work correctly, ensuring that Intel MKL libraries are properly linked.
This issue occurred in the following scenarios:
- Users installing
text-embeddings-router
viacargo
with the--feature mkl
flag. Althoughdynamic-linking
should have been used, it was not working as intended. - Users relying on the CPU
Dockerfile
when running models without ONNX weights. In these cases, Safetensors weights were used withcandle
as backend (with MKL optimizations), instead ofort
.
The following table shows the affected versions and containers:
Version | Image |
---|---|
1.7.0 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.0 |
1.7.1 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.1 |
1.7.2 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.2 |
1.7.3 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.3 |
1.7.4 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.4 |
1.8.0 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.0 |
1.8.1 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 |
More details: PR #715
Full Changelog: v1.8.1...v1.8.2
v1.8.1

Today, Google releases EmbeddingGemma, a state-of-the-art multilingual embedding model perfect for on-device use cases. Designed for speed and efficiency, the model features a compact size of 308M parameters and a 2K context window, unlocking new possibilities for mobile RAG pipelines, agents, and more. EmbeddingGemma is trained to support over 100 languages and is the highest-ranking text-only multilingual embedding model under 500M on the Massive Text Embedding Benchmark (MTEB) at the time of writing.
- CPU:
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \
--model-id google/embeddinggemma-300m --dtype float32
- CPU with ONNX Runtime:
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \
--model-id onnx-community/embeddinggemma-300m-ONNX --dtype float32 --pooling mean
- NVIDIA CUDA:
docker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cuda-1.8.1 \
--model-id google/embeddinggemma-300m --dtype float32
Notable Changes
- Add support for Gemma3 (text-only) architecture
- Intel updates to Synapse 1.21.3 and IPEX 2.8
- Extend ONNX Runtime support in
OrtRuntime
- Support
position_ids
andpast_key_values
as inputs - Handle
padding_side
andpad_token_id
- Support
What's Changed
- Adjust HPU warmup: use dummy inputs with shape more close to real scenario by @kaixuanliu in #689
- Add
extra_args
totrufflehog
to exclude unverified results by @alvarobartt in #696 - Update GitHub templates & fix mentions to Text Embeddings Inference by @alvarobartt in #697
- Disable Flash Attention with
USE_FLASH_ATTENTION
by @alvarobartt in #692 - Add support for
position_ids
andpast_key_values
inOrtBackend
by @alvarobartt in #700 - HPU upgrade to Synapse 1.21.3 by @kaixuanliu in #703
- Upgrade to IPEX 2.8 by @kaixuanliu in #702
- Parse
modules.json
to identify defaultDense
modules by @alvarobartt in #701 - Add
padding_side
andpad_token_id
inOrtBackend
by @alvarobartt in #705 - Update
docs/openapi.json
for v1.8.0 by @alvarobartt in #708 - Add Gemma3 architecture (text-only) by @alvarobartt in #711
- Update
version
to 1.8.1 by @alvarobartt in #712
Full Changelog: v1.8.0...v1.8.1
v1.8.0

Notable Changes
- Qwen3 support for 0.6B, 4B and 8B on CPU, MPS, and FlashQwen3 on CUDA and Intel HPUs
- NomicBert MoE support
- JinaAI Re-Rankers V1 support
- Matryoshka Representation Learning (MRL)
- Dense layer module support (after pooling)
Note
Some of the aforementioned changes were released within the patch versions on top of v1.7.0, whilst both Matryoshka Representation Learning (MRL) and Dense layer module support have been recently included and were not released yet.
What's Changed
- [Docs] Update quick tour by @NielsRogge in #574
- Update
README.md
andsupported_models.md
by @alvarobartt in #572 - Back with linting. by @Narsil in #577
- [Docs] Add cloud run example by @NielsRogge in #573
- Fixup by @Narsil in #578
- Fixing the tokenization routes token (offsets are in bytes, not in by @Narsil in #576
- Removing requirements file. by @Narsil in #585
- Removing candle-extensions to live on crates.io by @Narsil in #583
- Bump
sccache
to 0.10.0 andsccache-action
to 0.0.9 by @alvarobartt in #586 - optimize the performance of FlashBert Path for HPU by @kaixuanliu in #575
- Revert "Removing requirements file. (#585)" by @Narsil in #588
- Get opentelemetry trace id from request headers by @kozistr in #425
- Add argument for configuring Prometheus port by @kozistr in #589
- Adding missing
head.
prefix in the weight name inModernBertClassificationHead
by @kozistr in #591 - Fixing the CI (grpc path). by @Narsil in #593
- fix xpu env issue that cannot find right libur_loader.so.0 by @kaixuanliu in #595
- enable flash mistral model for HPU device by @kaixuanliu in #594
- remove optimum-habana dependency by @kaixuanliu in #599
- Support NomicBert MoE by @kozistr in #596
- Remove duplicate short option '-p' to fix router executable by @cebtenzzre in #602
- Update
text-embeddings-router --help
output by @alvarobartt in #603 - Warmup padded models too. by @Narsil in #592
- Add support for JinaAI Re-Rankers V1 by @alvarobartt in #582
- Gte diffs by @Narsil in #604
- Fix the weight name in GTEClassificationHead by @kozistr in #606
- upgrade pytorch and ipex to 2.7 version by @kaixuanliu in #607
- upgrade HPU FW to 1.21; upgrade transformers to 4.51.3 by @kaixuanliu in #608
- Patch DistilBERT variants with different weight keys by @alvarobartt in #614
- add offline modeling for model
jinaai/jina-embeddings-v2-base-code
to avoidauto_map
to other repository by @kaixuanliu in #612 - Add mean pooling strategy for Modernbert classifier by @kwnath in #616
- Using serde for pool validation. by @Narsil in #620
- Preparing the update to 1.7.1 by @Narsil in #623
- Adding suggestions to fixing missing ONNX files. by @Narsil in #624
- Add
Qwen3Model
by @alvarobartt in #627 - Add
HiddenAct::Silu
(removeserde
alias) by @alvarobartt in #631 - Add CPU support for Qwen3-Embedding models by @randomm in #632
- refactor the code and add wrap_in_hpu_graph to corner case by @kaixuanliu in #625
- Support Qwen3 w/ fp32 on GPU by @kozistr in #634
- Preparing the release. by @Narsil in #639
- Default to Qwen3 in
README.md
anddocs/
examples by @alvarobartt in #641 - Fix Qwen3 by @kozistr in #646
- Add integration tests for Gaudi by @baptistecolle in #598
- Fix Qwen3-Embedding batch vs single inference inconsistency by @lance-miles in #648
- Fix FlashQwen3 by @kozistr in #650
- Make flake work on metal by @Narsil in #654
- Fixing metal backend. by @Narsil in #655
- Qwen3 hpu support by @kaixuanliu in #656
- change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in #659
- Update
version
to 1.7.3 by @alvarobartt in #666 - Add last token pooling support for ORT. by @tpendragon in #664
- Fix Qwen3 Embedding Float16 DType by @tpendragon in #663
- Fix
fmt
by re-runningpre-commit
by @alvarobartt in #671 - Update
version
to 1.7.4 by @alvarobartt in #677 - Support MRL (Matryoshka Representation Learning) by @kozistr in #676
- Add
Dense
layer for2_Dense/
modules by @alvarobartt in #660 - Update
version
to 1.8.0 by @alvarobartt in #686
New Contributors
- @NielsRogge made their first contribution in #574
- @cebtenzzre made their first contribution in #602
- @kwnath made their first contribution in #616
- @randomm made their first contribution in #632
- @lance-miles made their first contribution in #648
- @tpendragon made their first contribution in #664
Full Changelog: v1.7.0...v1.8.0
v1.7.4
Noticeable Changes
Qwen3 was not working fine on CPU / MPS when sending batched requests on FP16 precision, due to the FP32 minimum value downcast (now manually set to FP16 minimum value instead) leading to null
values, as well as a missing to_dtype
call on the attention_bias
when working with batches.
What's Changed
- Fix Qwen3 Embedding Float16 DType by @tpendragon in #663
- Fix
fmt
by re-runningpre-commit
by @alvarobartt in #671 - Update
version
to 1.7.4 by @alvarobartt in #677
Full Changelog: v1.7.3...v1.7.4
v1.7.3
Noticeable Changes
Qwen3 support included for Intel HPU, and fixed for CPU / Metal / CUDA.
What's Changed
- Default to Qwen3 in
README.md
anddocs/
examples by @alvarobartt in #641 - Fix Qwen3 by @kozistr in #646
- Add integration tests for Gaudi by @baptistecolle in #598
- Fix Qwen3-Embedding batch vs single inference inconsistency by @lance-miles in #648
- Fix FlashQwen3 by @kozistr in #650
- Make flake work on metal by @Narsil in #654
- Fixing metal backend. by @Narsil in #655
- Qwen3 hpu support by @kaixuanliu in #656
- change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in #659
- Update
version
to 1.7.3 by @alvarobartt in #666 - Add last token pooling support for ORT. by @tpendragon in #664
New Contributors
- @lance-miles made their first contribution in #648
- @tpendragon made their first contribution in #664
Full Changelog: v1.7.2...v1.7.3
v1.7.2
Notable change
- Added support for Qwen3 embeddigns
What's Changed
- Adding suggestions to fixing missing ONNX files. by @Narsil in #624
- Add
Qwen3Model
by @alvarobartt in #627 - Add
HiddenAct::Silu
(removeserde
alias) by @alvarobartt in #631 - Add CPU support for Qwen3-Embedding models by @randomm in #632
- refactor the code and add wrap_in_hpu_graph to corner case by @kaixuanliu in #625
- Support Qwen3 w/ fp32 on GPU by @kozistr in #634
- Preparing the release. by @Narsil in #639
New Contributors
Full Changelog: v1.7.1...v1.7.2
v1.7.1
What's Changed
- [Docs] Update quick tour by @NielsRogge in #574
- Update
README.md
andsupported_models.md
by @alvarobartt in #572 - Back with linting. by @Narsil in #577
- [Docs] Add cloud run example by @NielsRogge in #573
- Fixup by @Narsil in #578
- Fixing the tokenization routes token (offsets are in bytes, not in by @Narsil in #576
- Removing requirements file. by @Narsil in #585
- Removing candle-extensions to live on crates.io by @Narsil in #583
- Bump
sccache
to 0.10.0 andsccache-action
to 0.0.9 by @alvarobartt in #586 - optimize the performance of FlashBert Path for HPU by @kaixuanliu in #575
- Revert "Removing requirements file. (#585)" by @Narsil in #588
- Get opentelemetry trace id from request headers by @kozistr in #425
- Add argument for configuring Prometheus port by @kozistr in #589
- Adding missing
head.
prefix in the weight name inModernBertClassificationHead
by @kozistr in #591 - Fixing the CI (grpc path). by @Narsil in #593
- fix xpu env issue that cannot find right libur_loader.so.0 by @kaixuanliu in #595
- enable flash mistral model for HPU device by @kaixuanliu in #594
- remove optimum-habana dependency by @kaixuanliu in #599
- Support NomicBert MoE by @kozistr in #596
- Remove duplicate short option '-p' to fix router executable by @cebtenzzre in #602
- Update
text-embeddings-router --help
output by @alvarobartt in #603 - Warmup padded models too. by @Narsil in #592
- Add support for JinaAI Re-Rankers V1 by @alvarobartt in #582
- Gte diffs by @Narsil in #604
- Fix the weight name in GTEClassificationHead by @kozistr in #606
- upgrade pytorch and ipex to 2.7 version by @kaixuanliu in #607
- upgrade HPU FW to 1.21; upgrade transformers to 4.51.3 by @kaixuanliu in #608
- Patch DistilBERT variants with different weight keys by @alvarobartt in #614
- add offline modeling for model
jinaai/jina-embeddings-v2-base-code
to avoidauto_map
to other repository by @kaixuanliu in #612 - Add mean pooling strategy for Modernbert classifier by @kwnath in #616
- Using serde for pool validation. by @Narsil in #620
- Preparing the update to 1.7.1 by @Narsil in #623
New Contributors
- @NielsRogge made their first contribution in #574
- @cebtenzzre made their first contribution in #602
- @kwnath made their first contribution in #616
Full Changelog: v1.7.0...v1.7.1
v1.7.0
Notable changes
- Upgrade dependencies heavily (candle 0.5 -> 0.8 and related)
- Added ModernBert support by @kozistr !
What's Changed
- Moving cublaslt into TEI extension for easier upgrade of candle globally by @Narsil in #542
- Upgrade candle2 by @Narsil in #543
- Upgrade candle3 by @Narsil in #545
- Fixing the static-linking. by @Narsil in #547
- Fix linking bis by @Narsil in #549
- Make
sliding_window
forQwen2
optional by @alvarobartt in #546 - Optimize the performance of FlashBert on HPU by using fast mode softmax by @kaixuanliu in #555
- Fixing cudarc to the latest unified bindings. by @Narsil in #558
- Fix typos / formatting in CLI args in Markdown files by @alvarobartt in #552
- Use custom
serde
deserializer for JinaBERT models by @alvarobartt in #559 - Implement the
ModernBert
model by @kozistr in #459 - Fixing FlashAttention ModernBert. by @Narsil in #560
- Enable ModernBert on metal by @ivarflakstad in #562
- Fix
{Bert,DistilBert}SpladeHead
when loading from Safetensors by @alvarobartt in #564 - add related docs for intel cpu/xpu/hpu container by @kaixuanliu in #550
- Update the doc for submodule. by @Narsil in #567
- Update
docs/source/en/custom_container.md
by @alvarobartt in #568 - Preparing for release 1.7.0 (candle update + modernbert). by @Narsil in #570
New Contributors
- @ivarflakstad made their first contribution in #562
Full Changelog: v1.6.1...v1.7.0
v1.6.1
What's Changed
- Enable intel devices CPU/XPU/HPU for python backend by @yuanwu2017 in #245
- add reranker model support for python backend by @kaixuanliu in #386
- (FIX): CI Security Fix - branchname injection by @glegendre01 in #479
- Upgrade TEI. by @Narsil in #501
- Pin
cargo-chef
installation to 0.1.62 by @alvarobartt in #469 - add
TRUST_REMOTE_CODE
param to python backend. by @kaixuanliu in #485 - Enable splade embeddings for Python backend by @pi314ever in #493
- Hpu bucketing by @kaixuanliu in #489
- Optimize flash bert path for hpu device by @kaixuanliu in #509
- upgrade ipex to 2.6 version for cpu/xpu by @kaixuanliu in #510
- fix bug for
MaskedLanguageModel
class` by @kaixuanliu in #513 - Fix double incrementing
te_request_count
metric by @kozistr in #486 - Add intel based images to the CI by @baptistecolle in #518
- Fix typo on intel docker image by @baptistecolle in #529
- chore: Upgrade to tokenizers 0.21.0 by @lightsofapollo in #512
- feat: add support for "model_type": "gte" by @anton-pt in #519
- Update
README.md
to include ONNX by @alvarobartt in #507 - Fusing both Gte Configs. by @Narsil in #530
- Add
HF_HUB_USER_AGENT_ORIGIN
by @alvarobartt in #534 - Use
--hf-token
instead of--hf-api-token
by @alvarobartt in #535 - Fixing the tests. by @Narsil in #531
- Support classification head for DistilBERT by @kozistr in #487
- add CLI flag
disable-spans
to toggle span trace logging by @obloomfield in #481 - feat: support HF_ENDPOINT environment when downloading model by @StrayDragon in #505
- Small fixup. by @Narsil in #537
- Fix
VarBuilder
handling in GTE e.g.gte-multilingual-reranker-base
by @Narsil in #538 - make a WA in case Bert model do not have
safetensor
file by @kaixuanliu in #515 - Add missing
match
ononnx/model.onnx
download by @alvarobartt in #472 - Fixing the impure flake devShell to be able to run python code. by @Narsil in #539
- Prepare for release. by @Narsil in #540
New Contributors
- @yuanwu2017 made their first contribution in #245
- @kaixuanliu made their first contribution in #386
- @Narsil made their first contribution in #501
- @pi314ever made their first contribution in #493
- @baptistecolle made their first contribution in #518
- @lightsofapollo made their first contribution in #512
- @anton-pt made their first contribution in #519
- @obloomfield made their first contribution in #481
- @StrayDragon made their first contribution in #505
Full Changelog: v1.6.0...v1.6.1
v1.6.0
What's Changed
- feat: support multiple backends at the same time by @OlivierDehaene in #440
- feat: GTE classification head by @kozistr in #441
- feat: Implement GTE model to support the non-flash-attn version by @kozistr in #446
- feat: Implement MPNet model (#363) by @kozistr in #447
Full Changelog: v1.5.1...v1.6.0