[FEAT]: (text splitter & chunking) (vector db) provide options to add prefixes for chunking and retrieval embeddings #3198

alber70g · 2025-02-13T16:39:23Z

What would you like to see?

When using nomic embed text it is required to give a prefix for the model to give correct embeddings. The prefixes are different based on what the purpose is of the embedding.
For example the chunks can be prefixed with search_document: <chunk> and the query for retrieval from the vector database needs to be prefixed with search_query: <query>.
Also separating the query sent to the embedding model to be sent to the vector database and the prompt we want to use for the LLM.

E.g.:
embedding query template can be a template with the query: search_query: {{question}}
prompt where prompt can be a template with result of the embedding

You're a helpful assistant that uses this context and only this context and no previous knowledge to answer the question mentioned after the context.

<context>
{{query_result}}
</context>

<question>
{{question}}
</question>

As of now we can prefix manually by adding the correct prefix to the chunk and prompt (assuming the prompt isn't prefixed with something else), but it would be useful to have an input field that will nest the query with it.

See also: https://huggingface.co/nomic-ai/nomic-embed-text-v1#usage

The text was updated successfully, but these errors were encountered:

timothycarambat · 2025-02-13T18:50:50Z

Is this behavior unique to nomic text embed? I have not seen this on other embedding models before.

It is certainly something we can add to both query and chunking/splitting but I worry that adding these details is going to confuse 99% of people thinking they need to fill it out resulting in worse embeddings.

alber70g · 2025-02-13T23:52:46Z

No, it's not unique to nomic-text-embed. There's various embedding models that have a prefix that can be direct the embeddings creation into a certain direction.

snowflake-arctic-embed-m and mixedbread-ai/mxbai-embed-large-v1 use Represent this sentence for searching relevant passages: for the documents, but no prefix for querying.

Nomic has others as well (see the aforementioned link)

Someone put a mini test together and presented the following diagram and it really seems to make a difference to have the prefix there

alber70g added enhancement New feature or request feature request labels Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT]: (text splitter & chunking) (vector db) provide options to add prefixes for chunking and retrieval embeddings #3198

[FEAT]: (text splitter & chunking) (vector db) provide options to add prefixes for chunking and retrieval embeddings #3198

alber70g commented Feb 13, 2025 •

edited

Loading

timothycarambat commented Feb 13, 2025

alber70g commented Feb 13, 2025

[FEAT]: (text splitter & chunking) (vector db) provide options to add prefixes for chunking and retrieval embeddings #3198

[FEAT]: (text splitter & chunking) (vector db) provide options to add prefixes for chunking and retrieval embeddings #3198

Comments

alber70g commented Feb 13, 2025 • edited Loading

What would you like to see?

timothycarambat commented Feb 13, 2025

alber70g commented Feb 13, 2025

alber70g commented Feb 13, 2025 •

edited

Loading