This semantic search application combines BM25 with SPLADE for re-ranking. The sample app demonstrates the splade-embedder.
This sample application demonstrates using an apache 2.0 licenced splade model checkpoint prithivida/Splade_PP_en_v1.
The original SPLADE repo and model checkpoints have restrictive licenses:
- HF model checkpoint naver/splade-v3
- GitHub naver splade repo naver/splade
There is a growing number of independent open-source sparse encoder checkpoints that are compatible with the Vespa splade embedder implementation:
- opensearch-project/opensearch-neural-sparse-encoding-v1
- Neural-Cherche is a library designed to fine-tune neural search models such as Splade
See exporting fill-mask language models to onnx format.
Requires at least Vespa 8.320.68
Follow Vespa getting started
through the vespa deploy
step, cloning splade
instead of album-recommendation
.
$ vespa feed ext/*.json
We demonstrate queries using the vespa-cli
tool,
please use -v
to see the curl equivalent using the HTTP query API.
$ vespa query 'query=stars' \ 'input.query(q)=embed(splade,@query)' \ 'presentation.format.tensors=short-value'
Which will produce the following hit output
{
"id": "id:doc:doc::3",
"relevance": 32.3258056640625,
"source": "text",
"fields": {
"matchfeatures": {
"bm25(chunk)": 1.041708310095213,
"bm25(title)": 0.9808292530117263,
"query(q)": {
"who": 1.1171875,
"star": 2.828125,
"stars": 2.875,
"sky": 0.9375,
"planet": 0.828125
},
"chunk_token_scores": {
"star": 7.291259765625,
"stars": 7.771484375,
"planet": 0.7310791015625
},
"title_token_scores": {
"star": 8.086669921875,
"stars": 8.4453125
}
},
"sddocname": "doc",
"documentid": "id:doc:doc::3",
"splade_chunk_embedding": {
"the": 0.84375,
"with": 1.3671875,
"star": 2.578125,
"stars": 2.703125,
"filled": 2.171875,
"planet": 0.8828125,
"universe": 1.4296875,
"fill": 2.03125,
"filling": 1.5546875,
"galaxy": 2.765625,
"galaxies": 1.7265625
},
"splade_title_embedding": {
"about": 1.984375,
"star": 2.859375,
"stars": 2.9375,
"documents": 1.8671875,
"starred": 0.81640625,
"document": 2.671875,
"concerning": 0.8671875
},
"title": "A document about stars",
"chunk": "The galaxy is filled with stars"
}
}
The rank-profile
used here is default
, specified in the schemas/doc.sd file.
It includes a match-features configuration specifying tensor and rank-features we want to return with each hit. We have:
bm25(title)
the bm25 score of the query, title pairbm25(chunk)
the bm25 score of the query, chunk pairquery(q)
- the splade query tensor produced by the embedder with all the tokens and their corresponding weightsplade_chunk_embedding
- the mapped tensor produced by the embedder at indexing time (chunk)splade_title_embedding
- the mapped tensor produced by the embedder at indexing time (title)chunk_token_scores
- the non-zero overlap between the mapped query tensor and the mapped chunk tensortitle_token_scores
- same as above, but for the title
The last two outputs allow us to highlight the terms of the source text for explainability.
Note that this application sets a high term-score-threshold
to reduce the output verbosity. This
setting controls which tokens are retained and used in the dot product calculation(s).
A higher threshold increases sparseness and reduces complexity and accuracy.
$ vespa query 'query=boats' \ 'input.query(q)=embed(splade,@query)' \ 'presentation.format.tensors=short-value'
$ vespa query 'query=humans talk a lot' \ 'input.query(q)=embed(splade,@query)' \ 'presentation.format.tensors=short-value'
Note that in this sample application, Vespa is not using the expanded sparse learned weights for retrieval (matching).
It's used in a phased ranking pipeline where we retrieve efficiently using Vespa's weakAnd algorithm with BM25.
This phased ranking pipeline considerably speeds up retrieval compared to using the lexical expansion.
It's also possible to retrieve/query using the wand
vespa query operator. See an example in
the documentation about using the wand.
We can also brute-force score and rank all documents that match a filter, this can also be accelerated by using multiple search threads per query.
vespa query 'yql=select * from doc where true' \ 'input.query(q)=embed(night sky of stars)' \ 'presentation.format.tensors=short-value'
For longer contexts using array inputs, see the tensor playground example for scoring options.
playground splade tensors in ranking
To export a model trained with fill-mask
(compatible with the splade-embedder
):
$ pip3 install optimum onnx
Export the model using the optimum-cli
with task fill-mask
:
$ optimum-cli export onnx --task fill-mask --model the-splade-model-id output
Remove the exported model files that are not needed by Vespa:
$ find models/ -type f ! -name 'model.onnx' ! -name 'tokenizer.json' | xargs rm
This is only relevant when running this sample application locally. Remove the container after use:
$ docker rm -f vespa