Graph-Based Vector Similarity Search: An Experimental Evaluation of the State-of-the-Art

Introduction

Collections of vector data often expand to billions of vectors with thousands of dimensions, thereby escalating the complexity of analysis. Similarity search serves as the linchpin for numerous critical analytical tasks. Graph-based similarity search has recently emerged as the preferred method for analytical tasks that do not necessitate theoretical guarantees on the quality of answers. In this survey, we undertake a thorough evaluation of twelve state-of-the-art in-memory graph-based similarity search methods :

SOTA Alg.	PAPER	CODE
KGraph	WWW'2011	C++/Python
NSG	VLDB'2019	C++
DPG	TKDE'2019	C++
Vamana	NeurIPS'2019	C++
EFANNA	arXiv'2016	C++
HNSW	TPAMI'2018	C++/Python
SPTAG-KDT	ACM MM'2012; CVPR'2012; TPAMI'2014	C++
SPTAG-BKT	ACM MM'2012; CVPR'2012; TPAMI'2014	C++
ELPIS	VLDB'2023	C++
NGT	SISAP'22	C++
LSHAPG	VLDB'23	C++
HVS	VLDB'21	C++
HCNNG	PR'2019	C++

Furthermore, we present a survey delineating the chronological evolution of these methods and evaluate their key design decisions, encompassing Seed Selection (SS) and Neighborhood Diversification (ND). This repository contains the code utilized to evaluate different design choices for SS and ND, and the indexing and search performance of the twelve SotA approaches. For each method, we include parametrization details and any changes done to the original code bases to ensure a fair comparison.

Datasets

We use the following four real datasets covering a variety of domains from deep network embeddings, computer vision, neuroscience and seismology: (i) Deep contains 1 billion vectors of 96 dimensions extracted from the last layers of a convolutional neural network; (ii) Sift consists of 1 billion SIFT vectors of size 128 representing image feature descriptions; (iii) SALD contains neuroscience MRI data and includes 200 million data series of size 128; (iv) Seismic contains 100 million data series of size 256 representing earthquake recordings at seismic stations worldwide.

ND/SS experiments

ND and SS experiments code is based on nmslib/hnswlib code for constructing insertion-based graphs with various ND approaches and executing searches using multiple SS techniques.

Usage

Prerequisites

GCC 4.9+ with OpenMP
CMake 3.5+

Compilation on Linux

mkdir Release
cd Release
cmake -DCMAKE_BUILD_TYPE=Release ..
make

Building

./Release/WTSS --dataset path/dataset.bin --dataset-size n --index-path path/indexdirname/ --timeseries-size dim  --K maxoutdegree  --L beamwidth --mode 0  --nd nd_type --prune prune_value --ep ep_type

Where:

path/dataset.bin is the absolute path to the dataset binary file.
n is the dataset size.
path/indexdirname/ is the absolute path where the index will be stored (the index folder should not already exist).
dim is the dimension.
maxoutdegree is the maximum outdegree for nodes during graph construction.
beamwidth is the beamwidth during candidate neighbor search.
nd_type is the type of ND method to use: 0 for RND, 1 for RRND, 2 for MOND and 3 for NoND.
prune_value is the value used during ND. For RRND, a value between 1.3-1.5 is recommended; for MOND, 60 yields the best results.
ep is the SS method to use during construction, with 0 for StackedNSW and 3 for KSREP.

ND Pruning Ratio

To output the ND pruning ratio during graph construction, uncomment the definition STATSND in ./include/PTK.h lines 12, 13, 14.

Construction NDC

To output the number of distance calculations during indexing, uncomment the definition DC_IDX in ./include/PTK.h lines 8, 9, 10.

Search

./Release/WTSS --queries path/queries.bin --queries-size n --index-path path/indexdirname/ --timeseries-size dim  --K k  --L beamwidth --mode 1 --ep ep_type

Where:

path/queries.bin is the absolute path to the query set binary file.
n is the query set size.
k is the number of NN results desired.
beamwidth is the size of the priority queue used during beam search, with beamwidth >= k.
ep_type is the type of SS method to use during search, with 0 for StackedNSW, 1 for medoid, 2 for SFREP, 3 for KSREP, and 4 for KDTrees.

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
Evaluate_Hardness		Evaluate_Hardness
NeighborhoodDiversification		NeighborhoodDiversification
SeedSelections		SeedSelections
Supplementary_Material		Supplementary_Material
code		code
data		data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Graph-Based Vector Similarity Search: An Experimental Evaluation of the State-of-the-Art

Introduction

Datasets

ND/SS experiments

Usage

Prerequisites

Compilation on Linux

Building

ND Pruning Ratio

Construction NDC

Search

About

Releases

Packages

Languages

zeraph6/GASS_Repo

Folders and files

Latest commit

History

Repository files navigation

Graph-Based Vector Similarity Search: An Experimental Evaluation of the State-of-the-Art

Introduction

Datasets

ND/SS experiments

Usage

Prerequisites

Compilation on Linux

Building

ND Pruning Ratio

Construction NDC

Search

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages