introduce optimized nndescent to build graph index #272

wxyucs · 2024-12-27T07:39:53Z

No description provided.

inabao · 2024-12-31T08:43:17Z

Algorithm Design and Related Issues

Building a proximal graph for search involves addressing two key issues: finding approximate K-nearest neighbors (KNN) and ensuring graph pruning while maintaining connectivity. For the former, NSG and NNDescent primarily obtain an approximate KNN graph through iterative methods, while HNSW and Vamana derive an optimal KNN graph mainly through search methods. For the latter, HNSW and Vamana ensure the reachability of each point by inserting reverse edges during the search process, whereas NSG ensures reachability through searches by constructing a tree and connecting each search node to the tree.

Despite this, the aforementioned methods still have some issues:

There are no constraints on the in-degree of nodes, which can result in some points having an in-degree of zero.
The edge pruning during the addition of reverse edges involves redundant calculations.
Concurrency performance is suboptimal.
Therefore, we aim to propose a new graph construction algorithm called Optimized NNDescent (ODescent) to explicitly address the above issues.

Algorithm Implementation

The algorithm mainly consists of three parts:

Finding the nearest neighbors through iterative methods.
The method for finding nearest neighbors relies on NNDescent, which iteratively refines the results.
In each iteration, checking the in-degree to ensure that no point has an in-degree below a specified threshold.
If a point A's in-degree falls below the threshold, point A replaces the furthest neighbor of its nearest neighbor B.
Pruning and adding reverse edges.
Edges are pruned based on the given parameter α, and reverse edges are connected to further ensure connectivity.
Experimental Results
GIST

#./build-release/tools/test_performance     '/tmp/dataset/gist-1m-960-euclidean.hdf5'     'build' 'diskann' '{"dim": 960, "dtype": "float32", "metric_type": "l2", "hnsw": {"max_degree": 12, "ef_construction": 100}, "diskann": {"max_degree": 24, "ef_construction": 100, "graph_type":"vamana", "alpha":1.2, "turn": 20, "pq_dims": 64, "pq_sample_rate": 0.1}}' '{"hnsw":{"ef_search":100},"diskann":{"ef_search":100,"beam_search":4,"io_limit":200,"use_reorder":true}}'
[2024-12-31 13:18:02.090] [info] diskann build full (graph) cost 2141.208s
[2024-12-31 13:20:05.831] [info] diskann build full (pq) cost 123.741s
[2024-12-31 13:20:11.570] [info] diskann build full (disk layout) cost 5.740s
[2024-12-31 13:20:19.164] [info] diskann serialize cost 2.466s
{
    "build_parameters": "{\"dim\": 960, \"dtype\": \"float32\", \"metric_type\": \"l2\", \"hnsw\": {\"max_degree\": 12, \"ef_construction\": 100}, \"diskann\": {\"max_degree\": 24, \"ef_construction\": 100, \"graph_type\":\"vamana\", \"alpha\":1.2, \"turn\": 20, \"pq_dims\": 64, \"pq_sample_rate\": 0.1}}",
    "build_time_in_second": 2275.817019665,
    "dataset": "/tmp/dataset/gist-1m-960-euclidean.hdf5",
    "index_name": "diskann",
    "num_base": 1000000,
    "tps": 439.40263710093876
}





#./build-release/tools/test_performance     '/tmp/dataset/gist-1m-960-euclidean.hdf5'     'build' 'diskann' '{"dim": 960, "dtype": "float32", "metric_type": "l2", "hnsw": {"max_degree": 12, "ef_construction": 100}, "diskann": {"max_degree": 24, "ef_construction": 100, "graph_type":"odescent", "alpha":1.2, "turn": 20, "pq_dims": 64, "pq_sample_rate": 0.1}}' '{"hnsw":{"ef_search":100},"diskann":{"ef_search":100,"beam_search":4,"io_limit":200,"use_reorder":true}}'
[2024-12-31 14:45:13.505] [info] odescent build full (graph) cost 2170.382s
[2024-12-31 14:46:57.184] [info] diskann build full (pq) cost 103.679s
[2024-12-31 14:47:00.827] [info] diskann build full (disk layout) cost 3.643s
[2024-12-31 14:47:04.318] [info] diskann serialize cost 921.343ms
{
    "build_parameters": "{\"dim\": 960, \"dtype\": \"float32\", \"metric_type\": \"l2\", \"hnsw\": {\"max_degree\": 12, \"ef_construction\": 100}, \"diskann\": {\"max_degree\": 24, \"ef_construction\": 100, \"graph_type\":\"odescent\", \"alpha\":1.2, \"turn\": 20, \"pq_dims\": 64, \"pq_sample_rate\": 0.1}}",
    "build_time_in_second": 2280.273783652,
    "dataset": "/tmp/dataset/gist-1m-960-euclidean.hdf5",
    "index_name": "diskann",
    "num_base": 1000000,
    "tps": 438.543830644072
}

#./build-release/tools/test_performance     '/tmp/dataset/gist-1m-960-euclidean.hdf5'     'search' 'diskann' '{"dim": 960, "dtype": "float32", "metric_type": "l2", "hnsw": {"max_degree": 12, "ef_construction": 100}, "diskann": {"max_degree": 24, "ef_construction": 100, "graph_type":"vamana", "alpha":1.2, "turn": 20, "pq_dims": 64, "pq_sample_rate": 0.1}}' '{"hnsw":{"ef_search":100},"diskann":{"ef_search":100,"beam_search":4,"io_limit":200,"use_reorder":true}}'
[2024-12-31 14:48:47.351] [info] diskann deserialize cost 174.976ms
{
    "correct": 903,
    "dataset": "/tmp/dataset/gist-1m-960-euclidean.hdf5",
    "estimate_used_memory": 74952960,
    "index_name": "diskann",
    "memory": 125366272,
    "num_query": 1000,
    "qps": 722.1118474664544,
    "recall": 0.902999997138977,
    "search_parameters": "{\"hnsw\":{\"ef_search\":100},\"diskann\":{\"ef_search\":100,\"beam_search\":4,\"io_limit\":200,\"use_reorder\":true}}",
    "search_time_in_second": 1.384827023,
    "top_k": 1
}

SIFT

#./build-release/tools/test_performance     '/tmp/dataset/sift-1m-128-euclidean.hdf5'     'build' 'diskann' '{"dim": 128, "dtype": "float32", "metric_type": "l2", "hnsw": {"max_degree": 12, "ef_construction": 100}, "diskann": {"max_degree": 24, "ef_construction": 200, "graph_type":"vamana", "alpha":1.2, "turn": 20, "pq_dims": 64, "pq_sample_rate": 0.1}}' '{"hnsw":{"ef_search":100},"diskann":{"ef_search":100,"beam_search":4,"io_limit":200,"use_reorder":true}}'
[2024-12-31 15:22:07.361] [info] diskann build full (graph) cost 821.301s
[2024-12-31 15:24:25.050] [info] diskann build full (pq) cost 137.689s
[2024-12-31 15:24:26.217] [info] diskann build full (disk layout) cost 1.167s
[2024-12-31 15:24:27.774] [info] diskann serialize cost 444.411ms
{
    "build_parameters": "{\"dim\": 128, \"dtype\": \"float32\", \"metric_type\": \"l2\", \"hnsw\": {\"max_degree\": 12, \"ef_construction\": 100}, \"diskann\": {\"max_degree\": 24, \"ef_construction\": 200, \"graph_type\":\"vamana\", \"alpha\":1.2, \"turn\": 20, \"pq_dims\": 64, \"pq_sample_rate\": 0.1}}",
    "build_time_in_second": 961.269516893,
    "dataset": "/tmp/dataset/sift-1m-128-euclidean.hdf5",
    "index_name": "diskann",
    "num_base": 1000000,
    "tps": 1040.290971914083
}


#./build-release/tools/test_performance     '/tmp/dataset/sift-1m-128-euclidean.hdf5'     'search' 'diskann' '{"dim": 128, "dtype": "float32", "metric_type": "l2", "hnsw": {"max_degree": 12, "ef_construction": 100}, "diskann": {"max_degree": 24, "ef_construction": 200, "graph_type":"vamana", "alpha":1.2, "turn": 20, "pq_dims": 64, "pq_sample_rate": 0.1}}' '{"hnsw":{"ef_search":100},"diskann":{"ef_search":100,"beam_search":4,"io_limit":200,"use_reorder":true}}'
[2024-12-31 15:25:05.247] [info] diskann deserialize cost 173.148ms
{
    "correct": 9851,
    "dataset": "/tmp/dataset/sift-1m-128-euclidean.hdf5",
    "estimate_used_memory": 72393728,
    "index_name": "diskann",
    "memory": 117366784,
    "num_query": 10000,
    "qps": 897.6470347211786,
    "recall": 0.9850999712944031,
    "search_parameters": "{\"hnsw\":{\"ef_search\":100},\"diskann\":{\"ef_search\":100,\"beam_search\":4,\"io_limit\":200,\"use_reorder\":true}}",
    "search_time_in_second": 11.14023621,
    "top_k": 1
}

wxyucs added kind/feature New feature or request version/0.13 labels Dec 27, 2024

wxyucs changed the title ~~introduce nndecent to build graph index~~ introduce nndescent to build graph index Dec 27, 2024

wxyucs assigned inabao Dec 27, 2024

inabao changed the title ~~introduce nndescent to build graph index~~ introduce optimized nndescent to build graph index Dec 31, 2024

inabao mentioned this issue Dec 31, 2024

implement optimized nndescent in diskann #287

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduce optimized nndescent to build graph index #272

introduce optimized nndescent to build graph index #272

wxyucs commented Dec 27, 2024

inabao commented Dec 31, 2024

introduce optimized nndescent to build graph index #272

introduce optimized nndescent to build graph index #272

Comments

wxyucs commented Dec 27, 2024

inabao commented Dec 31, 2024