Releases · huggingface/text-embeddings-inference

09 Sep 14:45

v1.8.2

d7af1fc

v1.8.2 Latest

Latest

🔧 Fixed Intel MKL Support

Since Text Embeddings Inference (TEI) v1.7.0, Intel MKL support had been broken due to changes in the candle dependency. Neither static-linking nor dynamic-linking worked correctly, which caused models using Intel MKL on CPU to fail with errors such as: "Intel oneMKL ERROR: Parameter 13 was incorrect on entry to SGEMM".

Starting with v1.8.2, this issue has been resolved by fixing how the intel-mkl-src dependency is defined. Both features, static-linking and dynamic-linking (the default), now work correctly, ensuring that Intel MKL libraries are properly linked.

This issue occurred in the following scenarios:

Users installing text-embeddings-router via cargo with the --feature mkl flag. Although dynamic-linking should have been used, it was not working as intended.
Users relying on the CPU Dockerfile when running models without ONNX weights. In these cases, Safetensors weights were used with candle as backend (with MKL optimizations), instead of ort.

The following table shows the affected versions and containers:

Version	Image
1.7.0	`ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.0`
1.7.1	`ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.1`
1.7.2	`ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.2`
1.7.3	`ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.3`
1.7.4	`ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.4`
1.8.0	`ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.0`
1.8.1	`ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1`

More details: PR #715

Full Changelog: v1.8.1...v1.8.2

Assets 2

04 Sep 15:22

alvarobartt

v1.8.1

0adb000

v1.8.1

Today, Google releases EmbeddingGemma, a state-of-the-art multilingual embedding model perfect for on-device use cases. Designed for speed and efficiency, the model features a compact size of 308M parameters and a 2K context window, unlocking new possibilities for mobile RAG pipelines, agents, and more. EmbeddingGemma is trained to support over 100 languages and is the highest-ranking text-only multilingual embedding model under 500M on the Massive Text Embedding Benchmark (MTEB) at the time of writing.

CPU:

docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \
    --model-id google/embeddinggemma-300m --dtype float32

CPU with ONNX Runtime:

docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \
    --model-id onnx-community/embeddinggemma-300m-ONNX --dtype float32 --pooling mean

NVIDIA CUDA:

docker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cuda-1.8.1 \
    --model-id google/embeddinggemma-300m --dtype float32

Notable Changes

Add support for Gemma3 (text-only) architecture
Intel updates to Synapse 1.21.3 and IPEX 2.8
Extend ONNX Runtime support in OrtRuntime
- Support position_ids and past_key_values as inputs
- Handle padding_side and pad_token_id

What's Changed

Adjust HPU warmup: use dummy inputs with shape more close to real scenario by @kaixuanliu in #689
Add extra_args to trufflehog to exclude unverified results by @alvarobartt in #696
Update GitHub templates & fix mentions to Text Embeddings Inference by @alvarobartt in #697
Disable Flash Attention with USE_FLASH_ATTENTION by @alvarobartt in #692
Add support for position_ids and past_key_values in OrtBackend by @alvarobartt in #700
HPU upgrade to Synapse 1.21.3 by @kaixuanliu in #703
Upgrade to IPEX 2.8 by @kaixuanliu in #702
Parse modules.json to identify default Dense modules by @alvarobartt in #701
Add padding_side and pad_token_id in OrtBackend by @alvarobartt in #705
Update docs/openapi.json for v1.8.0 by @alvarobartt in #708
Add Gemma3 architecture (text-only) by @alvarobartt in #711
Update version to 1.8.1 by @alvarobartt in #712

Full Changelog: v1.8.0...v1.8.1

Contributors

kaixuanliu and alvarobartt

Assets 2

05 Aug 08:31

alvarobartt

v1.8.0

2bff275

v1.8.0

Notable Changes

Qwen3 support for 0.6B, 4B and 8B on CPU, MPS, and FlashQwen3 on CUDA and Intel HPUs
NomicBert MoE support
JinaAI Re-Rankers V1 support
Matryoshka Representation Learning (MRL)
Dense layer module support (after pooling)

Note

Some of the aforementioned changes were released within the patch versions on top of v1.7.0, whilst both Matryoshka Representation Learning (MRL) and Dense layer module support have been recently included and were not released yet.

What's Changed

[Docs] Update quick tour by @NielsRogge in #574
Update README.md and supported_models.md by @alvarobartt in #572
Back with linting. by @Narsil in #577
[Docs] Add cloud run example by @NielsRogge in #573
Fixup by @Narsil in #578
Fixing the tokenization routes token (offsets are in bytes, not in by @Narsil in #576
Removing requirements file. by @Narsil in #585
Removing candle-extensions to live on crates.io by @Narsil in #583
Bump sccache to 0.10.0 and sccache-action to 0.0.9 by @alvarobartt in #586
optimize the performance of FlashBert Path for HPU by @kaixuanliu in #575
Revert "Removing requirements file. (#585)" by @Narsil in #588
Get opentelemetry trace id from request headers by @kozistr in #425
Add argument for configuring Prometheus port by @kozistr in #589
Adding missing head. prefix in the weight name in ModernBertClassificationHead by @kozistr in #591
Fixing the CI (grpc path). by @Narsil in #593
fix xpu env issue that cannot find right libur_loader.so.0 by @kaixuanliu in #595
enable flash mistral model for HPU device by @kaixuanliu in #594
remove optimum-habana dependency by @kaixuanliu in #599
Support NomicBert MoE by @kozistr in #596
Remove duplicate short option '-p' to fix router executable by @cebtenzzre in #602
Update text-embeddings-router --help output by @alvarobartt in #603
Warmup padded models too. by @Narsil in #592
Add support for JinaAI Re-Rankers V1 by @alvarobartt in #582
Gte diffs by @Narsil in #604
Fix the weight name in GTEClassificationHead by @kozistr in #606
upgrade pytorch and ipex to 2.7 version by @kaixuanliu in #607
upgrade HPU FW to 1.21; upgrade transformers to 4.51.3 by @kaixuanliu in #608
Patch DistilBERT variants with different weight keys by @alvarobartt in #614
add offline modeling for model jinaai/jina-embeddings-v2-base-code to avoid auto_map to other repository by @kaixuanliu in #612
Add mean pooling strategy for Modernbert classifier by @kwnath in #616
Using serde for pool validation. by @Narsil in #620
Preparing the update to 1.7.1 by @Narsil in #623
Adding suggestions to fixing missing ONNX files. by @Narsil in #624
Add Qwen3Model by @alvarobartt in #627
Add HiddenAct::Silu (remove serde alias) by @alvarobartt in #631
Add CPU support for Qwen3-Embedding models by @randomm in #632
refactor the code and add wrap_in_hpu_graph to corner case by @kaixuanliu in #625
Support Qwen3 w/ fp32 on GPU by @kozistr in #634
Preparing the release. by @Narsil in #639
Default to Qwen3 in README.md and docs/ examples by @alvarobartt in #641
Fix Qwen3 by @kozistr in #646
Add integration tests for Gaudi by @baptistecolle in #598
Fix Qwen3-Embedding batch vs single inference inconsistency by @lance-miles in #648
Fix FlashQwen3 by @kozistr in #650
Make flake work on metal by @Narsil in #654
Fixing metal backend. by @Narsil in #655
Qwen3 hpu support by @kaixuanliu in #656
change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in #659
Update version to 1.7.3 by @alvarobartt in #666
Add last token pooling support for ORT. by @tpendragon in #664
Fix Qwen3 Embedding Float16 DType by @tpendragon in #663
Fix fmt by re-running pre-commit by @alvarobartt in #671
Update version to 1.7.4 by @alvarobartt in #677
Support MRL (Matryoshka Representation Learning) by @kozistr in #676
Add Dense layer for 2_Dense/ modules by @alvarobartt in #660
Update version to 1.8.0 by @alvarobartt in #686

New Contributors

@NielsRogge made their first contribution in #574
@cebtenzzre made their first contribution in #602
@kwnath made their first contribution in #616
@randomm made their first contribution in #632
@lance-miles made their first contribution in #648
@tpendragon made their first contribution in #664

Full Changelog: v1.7.0...v1.8.0

Contributors

Narsil, randomm, and 9 other contributors

Assets 2

07 Jul 12:33

alvarobartt

v1.7.4

6e900af

v1.7.4

Noticeable Changes

Qwen3 was not working fine on CPU / MPS when sending batched requests on FP16 precision, due to the FP32 minimum value downcast (now manually set to FP16 minimum value instead) leading to null values, as well as a missing to_dtype call on the attention_bias when working with batches.

What's Changed

Fix Qwen3 Embedding Float16 DType by @tpendragon in #663
Fix fmt by re-running pre-commit by @alvarobartt in #671
Update version to 1.7.4 by @alvarobartt in #677

Full Changelog: v1.7.3...v1.7.4

Contributors

tpendragon and alvarobartt

Assets 2

30 Jun 10:54

alvarobartt

v1.7.3

fb80177

v1.7.3

Noticeable Changes

Qwen3 support included for Intel HPU, and fixed for CPU / Metal / CUDA.

What's Changed

Default to Qwen3 in README.md and docs/ examples by @alvarobartt in #641
Fix Qwen3 by @kozistr in #646
Add integration tests for Gaudi by @baptistecolle in #598
Fix Qwen3-Embedding batch vs single inference inconsistency by @lance-miles in #648
Fix FlashQwen3 by @kozistr in #650
Make flake work on metal by @Narsil in #654
Fixing metal backend. by @Narsil in #655
Qwen3 hpu support by @kaixuanliu in #656
change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in #659
Update version to 1.7.3 by @alvarobartt in #666
Add last token pooling support for ORT. by @tpendragon in #664

New Contributors

@lance-miles made their first contribution in #648
@tpendragon made their first contribution in #664

Full Changelog: v1.7.2...v1.7.3

Contributors

Narsil, tpendragon, and 5 other contributors

Assets 2

16 Jun 06:44

Narsil

v1.7.2

a69cc2e

v1.7.2

Notable change

Added support for Qwen3 embeddigns

What's Changed

Adding suggestions to fixing missing ONNX files. by @Narsil in #624
Add Qwen3Model by @alvarobartt in #627
Add HiddenAct::Silu (remove serde alias) by @alvarobartt in #631
Add CPU support for Qwen3-Embedding models by @randomm in #632
refactor the code and add wrap_in_hpu_graph to corner case by @kaixuanliu in #625
Support Qwen3 w/ fp32 on GPU by @kozistr in #634
Preparing the release. by @Narsil in #639

New Contributors

@randomm made their first contribution in #632

Full Changelog: v1.7.1...v1.7.2

Contributors

Narsil, randomm, and 3 other contributors

Assets 2

03 Jun 13:38

Narsil

v1.7.1

006e16b

v1.7.1

What's Changed

[Docs] Update quick tour by @NielsRogge in #574
Update README.md and supported_models.md by @alvarobartt in #572
Back with linting. by @Narsil in #577
[Docs] Add cloud run example by @NielsRogge in #573
Fixup by @Narsil in #578
Fixing the tokenization routes token (offsets are in bytes, not in by @Narsil in #576
Removing requirements file. by @Narsil in #585
Removing candle-extensions to live on crates.io by @Narsil in #583
Bump sccache to 0.10.0 and sccache-action to 0.0.9 by @alvarobartt in #586
optimize the performance of FlashBert Path for HPU by @kaixuanliu in #575
Revert "Removing requirements file. (#585)" by @Narsil in #588
Get opentelemetry trace id from request headers by @kozistr in #425
Add argument for configuring Prometheus port by @kozistr in #589
Adding missing head. prefix in the weight name in ModernBertClassificationHead by @kozistr in #591
Fixing the CI (grpc path). by @Narsil in #593
fix xpu env issue that cannot find right libur_loader.so.0 by @kaixuanliu in #595
enable flash mistral model for HPU device by @kaixuanliu in #594
remove optimum-habana dependency by @kaixuanliu in #599
Support NomicBert MoE by @kozistr in #596
Remove duplicate short option '-p' to fix router executable by @cebtenzzre in #602
Update text-embeddings-router --help output by @alvarobartt in #603
Warmup padded models too. by @Narsil in #592
Add support for JinaAI Re-Rankers V1 by @alvarobartt in #582
Gte diffs by @Narsil in #604
Fix the weight name in GTEClassificationHead by @kozistr in #606
upgrade pytorch and ipex to 2.7 version by @kaixuanliu in #607
upgrade HPU FW to 1.21; upgrade transformers to 4.51.3 by @kaixuanliu in #608
Patch DistilBERT variants with different weight keys by @alvarobartt in #614
add offline modeling for model jinaai/jina-embeddings-v2-base-code to avoid auto_map to other repository by @kaixuanliu in #612
Add mean pooling strategy for Modernbert classifier by @kwnath in #616
Using serde for pool validation. by @Narsil in #620
Preparing the update to 1.7.1 by @Narsil in #623

New Contributors

@NielsRogge made their first contribution in #574
@cebtenzzre made their first contribution in #602
@kwnath made their first contribution in #616

Full Changelog: v1.7.0...v1.7.1

Contributors

Narsil, kaixuanliu, and 5 other contributors

Assets 2

08 Apr 11:54

Narsil

v1.7.0

72dac20

v1.7.0

Notable changes

Upgrade dependencies heavily (candle 0.5 -> 0.8 and related)
Added ModernBert support by @kozistr !

What's Changed

Moving cublaslt into TEI extension for easier upgrade of candle globally by @Narsil in #542
Upgrade candle2 by @Narsil in #543
Upgrade candle3 by @Narsil in #545
Fixing the static-linking. by @Narsil in #547
Fix linking bis by @Narsil in #549
Make sliding_window for Qwen2 optional by @alvarobartt in #546
Optimize the performance of FlashBert on HPU by using fast mode softmax by @kaixuanliu in #555
Fixing cudarc to the latest unified bindings. by @Narsil in #558
Fix typos / formatting in CLI args in Markdown files by @alvarobartt in #552
Use custom serde deserializer for JinaBERT models by @alvarobartt in #559
Implement the ModernBert model by @kozistr in #459
Fixing FlashAttention ModernBert. by @Narsil in #560
Enable ModernBert on metal by @ivarflakstad in #562
Fix {Bert,DistilBert}SpladeHead when loading from Safetensors by @alvarobartt in #564
add related docs for intel cpu/xpu/hpu container by @kaixuanliu in #550
Update the doc for submodule. by @Narsil in #567
Update docs/source/en/custom_container.md by @alvarobartt in #568
Preparing for release 1.7.0 (candle update + modernbert). by @Narsil in #570

New Contributors

@ivarflakstad made their first contribution in #562

Full Changelog: v1.6.1...v1.7.0

Contributors

Narsil, kaixuanliu, and 3 other contributors

Assets 2

28 Mar 08:47

Narsil

v1.6.1

875239e

v1.6.1

What's Changed

Enable intel devices CPU/XPU/HPU for python backend by @yuanwu2017 in #245
add reranker model support for python backend by @kaixuanliu in #386
(FIX): CI Security Fix - branchname injection by @glegendre01 in #479
Upgrade TEI. by @Narsil in #501
Pin cargo-chef installation to 0.1.62 by @alvarobartt in #469
add TRUST_REMOTE_CODE param to python backend. by @kaixuanliu in #485
Enable splade embeddings for Python backend by @pi314ever in #493
Hpu bucketing by @kaixuanliu in #489
Optimize flash bert path for hpu device by @kaixuanliu in #509
upgrade ipex to 2.6 version for cpu/xpu by @kaixuanliu in #510
fix bug for MaskedLanguageModel class` by @kaixuanliu in #513
Fix double incrementing te_request_count metric by @kozistr in #486
Add intel based images to the CI by @baptistecolle in #518
Fix typo on intel docker image by @baptistecolle in #529
chore: Upgrade to tokenizers 0.21.0 by @lightsofapollo in #512
feat: add support for "model_type": "gte" by @anton-pt in #519
Update README.md to include ONNX by @alvarobartt in #507
Fusing both Gte Configs. by @Narsil in #530
Add HF_HUB_USER_AGENT_ORIGIN by @alvarobartt in #534
Use --hf-token instead of --hf-api-token by @alvarobartt in #535
Fixing the tests. by @Narsil in #531
Support classification head for DistilBERT by @kozistr in #487
add CLI flag disable-spans to toggle span trace logging by @obloomfield in #481
feat: support HF_ENDPOINT environment when downloading model by @StrayDragon in #505
Small fixup. by @Narsil in #537
Fix VarBuilder handling in GTE e.g. gte-multilingual-reranker-base by @Narsil in #538
make a WA in case Bert model do not have safetensor file by @kaixuanliu in #515
Add missing match on onnx/model.onnx download by @alvarobartt in #472
Fixing the impure flake devShell to be able to run python code. by @Narsil in #539
Prepare for release. by @Narsil in #540

New Contributors

@yuanwu2017 made their first contribution in #245
@kaixuanliu made their first contribution in #386
@Narsil made their first contribution in #501
@pi314ever made their first contribution in #493
@baptistecolle made their first contribution in #518
@lightsofapollo made their first contribution in #512
@anton-pt made their first contribution in #519
@obloomfield made their first contribution in #481
@StrayDragon made their first contribution in #505

Full Changelog: v1.6.0...v1.6.1

Contributors

Narsil, lightsofapollo, and 10 other contributors

Assets 2

13 Dec 15:52

OlivierDehaene

v1.6.0

57d8fc8

v1.6.0

What's Changed

feat: support multiple backends at the same time by @OlivierDehaene in #440
feat: GTE classification head by @kozistr in #441
feat: Implement GTE model to support the non-flash-attn version by @kozistr in #446
feat: Implement MPNet model (#363) by @kozistr in #447

Full Changelog: v1.5.1...v1.6.0

Contributors

kozistr and OlivierDehaene

Assets 2

Releases: huggingface/text-embeddings-inference

v1.8.2

🔧 Fixed Intel MKL Support

Uh oh!

v1.8.1

Notable Changes

What's Changed

Contributors

Uh oh!

v1.8.0

Notable Changes

What's Changed

New Contributors

Contributors

Uh oh!

v1.7.4

Noticeable Changes

What's Changed

Contributors

Uh oh!

v1.7.3

Noticeable Changes

What's Changed

New Contributors

Contributors

Uh oh!

v1.7.2

Notable change

What's Changed

New Contributors

Contributors

Uh oh!

v1.7.1

What's Changed

New Contributors

Contributors

Uh oh!

v1.7.0

Notable changes

What's Changed

New Contributors

Contributors

Uh oh!

v1.6.1

What's Changed

New Contributors

Contributors

Uh oh!

v1.6.0

What's Changed

Contributors

Uh oh!