Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT]: (text splitter & chunking) (vector db) provide options to add prefixes for chunking and retrieval embeddings #3198

Open
alber70g opened this issue Feb 13, 2025 · 2 comments
Labels
enhancement New feature or request feature request

Comments

@alber70g
Copy link

alber70g commented Feb 13, 2025

What would you like to see?

When using nomic embed text it is required to give a prefix for the model to give correct embeddings. The prefixes are different based on what the purpose is of the embedding.
For example the chunks can be prefixed with search_document: <chunk> and the query for retrieval from the vector database needs to be prefixed with search_query: <query>.
Also separating the query sent to the embedding model to be sent to the vector database and the prompt we want to use for the LLM.

E.g.:
embedding query template can be a template with the query: search_query: {{question}}
prompt where prompt can be a template with result of the embedding

You're a helpful assistant that uses this context and only this context and no previous knowledge to answer the question mentioned after the context.

<context>
{{query_result}}
</context>

<question>
{{question}}
</question>

As of now we can prefix manually by adding the correct prefix to the chunk and prompt (assuming the prompt isn't prefixed with something else), but it would be useful to have an input field that will nest the query with it.

See also: https://huggingface.co/nomic-ai/nomic-embed-text-v1#usage

@alber70g alber70g added enhancement New feature or request feature request labels Feb 13, 2025
@timothycarambat
Copy link
Member

Is this behavior unique to nomic text embed? I have not seen this on other embedding models before.

It is certainly something we can add to both query and chunking/splitting but I worry that adding these details is going to confuse 99% of people thinking they need to fill it out resulting in worse embeddings.

@alber70g
Copy link
Author

No, it's not unique to nomic-text-embed. There's various embedding models that have a prefix that can be direct the embeddings creation into a certain direction.

snowflake-arctic-embed-m and mixedbread-ai/mxbai-embed-large-v1 use Represent this sentence for searching relevant passages: for the documents, but no prefix for querying.

Nomic has others as well (see the aforementioned link)

Someone put a mini test together and presented the following diagram and it really seems to make a difference to have the prefix there

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature request
Projects
None yet
Development

No branches or pull requests

2 participants