-
Notifications
You must be signed in to change notification settings - Fork 16
HNSW Graph: Design Specification #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2fe5cad to
af05b8a
Compare
af05b8a to
4f2fa56
Compare
d522bd5 to
389e194
Compare
| * `ep` is **temporarily updated** at each layer to point to the best candidate from the previous layer; the global entry point remains unchanged. | ||
| * The algorithm mimics **INSERT’s first phase** but without actually inserting a new element. You’re just finding nearest neighbors. | ||
| * It seems like ef in K-NN-SEARCH could be treated as optional, defaulting to efConstruction if the user doesn’t provide a value. This way, the search would automatically match the recall used during insertion, while still allowing users to override it for custom search precision or speed. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't seem to cover deletion of a node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding deletion: the original HNSW paper does not describe any true node-deletion procedure, which leads me to assume full deletion isn’t supported. Given how interconnected the graph becomes, removing a node would effectively require rebuilding significant parts of the structure. The only related mechanism discussed is neighbor pruning through search-layer heuristics, not actual removal. We can mark full deletion as a TODO / requires research.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not having deletion isnt particularly tenable though but I agree we can mark as TODO
In order to merge it in though we'd need to give users the assurance that upon removing embeddings and searching through the index doesn't turn up stale vectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's now an algorithm for delete, check it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you're curious this paper from microsoft outlines a pretty solid deletion strategy which ensures your search won't degrade after multiple rounds of insertions/deletions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this reference! I’ll check it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nnethercott Thanks again for the reference! I went through it and it’s a solid approach for deletion.
For my implementation, I’m leaning toward a slightly different strategy: maintaining back-links for each node. This makes deletion straightforward, as we only need to remove references from incoming neighbors (back-links), keeping the operation localized. There’s a memory overhead, but for the speed and simplicity of deletions, it’s a trade-off I’m willing to accept for now.
I understand that this approach means insertions will involve more updates to other nodes, and I’m curious to see how costly that will be in practice. For now, I plan to stick with this approach and will consider reverting only if the drawbacks outweigh the benefits.
We’ll also test recall by potentially replicating the deletion experiment from the paper: randomly remove a percentage of nodes and reinsert them over multiple cycles, then observe whether search performance (recall) remains stable. This will allow us to compare how our back-link deletion strategy performs relative to the Delete Policies described in the article.
389e194 to
9e82b18
Compare
9e82b18 to
156a1ca
Compare
This PR begins the work on #184