Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering based on expressions #152

Open
patelprateek opened this issue Aug 27, 2022 · 5 comments
Open

Filtering based on expressions #152

patelprateek opened this issue Aug 27, 2022 · 5 comments

Comments

@patelprateek
Copy link

New user here so please bear with me some naive questions.

Is it possible to have some user code , math expression based filtering while doing graph search (hnsw)

I read documentation , can you provide some guidance or pointer on why filtering query language was based on mongodb query operators , any merits here as opposed to supporting sql operators supported by different databases ?

Looking at the code , i observe filtering done by binary_fuse_filter , could you please provide some details on if the filtering is pre-filtering or post retrieval filtering or during graph traversal filtering ? are their any engineering blog or documentation or the filtering strategy that you can point me to , wondering how canwe extend for other expressions based filtering

@numb3r3
Copy link
Member

numb3r3 commented Aug 29, 2022

Thanks for your interest.

why filtering query language was based on mongodb query operators , any merits here as opposed to supporting sql operators supported by different databases ?

For the query, we use MongoDB query as the query language. At the runtime, the mongo query will be transferred to SQL clauses (we implemented a simple converting function in annlite) which will be applied to filter the whole data. Here, we want to provide an easy way to construct the query. SQL-based query needs the user to know all of the information about the table name (it is an internal variable, not visible to the general user) and column names. And what's more, mongodb-style query is a Dict, which is widely used by modern KV database (ES, MongoDB, ...).

could you please provide some details on if the filtering is pre-filtering or post retrieval filtering or during graph traversal filtering ?

It is graph traversal filtering indeed. Each node is marked with bool value to indicate whether it matches the query conditions. At the hnsw graph search time, the same traversal search algorithm is performed to find the closed-neighors, and only the valid node is considered as the candidate result to be returned (here, it seems like a pre-filtering strategy).

are their any engineering blog or documentation or the filtering strategy that you can point me to , wondering how canwe extend for other expressions based filtering

For the technical design documentation, we are working on them. And what's more, which expression do you want to use for filtering in your case?

@patelprateek
Copy link
Author

Thanks for the info. the use case i am thinking is evaluating some expression like distance < 0.xyz and metadata_color = "red" . I was under the impressions that the filter is constructed first by running the operators over some indexed data generating a small fuse filter of allowed/blocklisted candidates , and then during traversal we are just doing a probabilistic check ? we are not evaluating operators on neighbours during graph search but rather checking membership ?

@numb3r3
Copy link
Member

numb3r3 commented Sep 2, 2022

we are using allowed/blocklisted candidates, and check membership during graph search.

@bandhakavi
Copy link

Hi, great discussion! During searching with filters are there any chances of a disconnected graph cropping up - say when all the neighbors of a node are invalidated by a filter?

It is graph traversal filtering indeed. Each node is marked with bool value to indicate whether it matches the query conditions. At the hnsw graph search time, the same traversal search algorithm is performed to find the closed-neighors, and only the valid node is considered as the candidate result to be returned (here, it seems like a pre-filtering strategy).

@numb3r3
Copy link
Member

numb3r3 commented Nov 4, 2022

For your question, actually, all of the graph nodes are considered valid during graph traversal. Only the nodes which match with the filter will be considered as the result candidates. Thus, there is no disconnected graph problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants