Skip to content

Commit 5343849

Browse files
KShivendujoeinpre-commit-ci[bot]generall
authored
feat: Add sparse vectors benchmark support for Qdrant (#114)
* feat: Add sparse vectors benchmark support in Qdrant * fix: Self review * feat: Add sparse dataset for CI benchmarks * feat: Introduce SparseVector class * feat: Disallow sparse vector dataset being run with non sparse vector engine configs * feat: use different engine config to run sparse vector benchmarks * fix: use different engine config to run sparse vector benchmarks * feat: Optimize CI benchmarks workflow * feat: Add 1M sparse dataset * fix: remove scipy, read csr matrix manually (#117) * fix: remove scipy, read csr matrix manually * fix: Dataset query reader should have sparse_vector=None by default * refactor: Changes based on feedback * refactoring: refactor sparse vector support (#118) * refactoring: refactor sparse vector support * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * feat: Use pydantic construct * refactor: Update all engines to use Query and Record dataclasses (#116) * refactor: Update all engines to use Query and Record dataclasses * feat: Add ruff in pre-commit hooks * fix: Type mismatches * fix: Redis search client types and var names * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: Type issues detected by linter * fix: iter_batches func type * refactor: knn_conditions should be class level constant --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: Type issue * fix: Allow python 3.8 since scipy is now removed * fix: Add missing redis-m-16-ef-128 config * fix: redis container port * fix linter --------- Co-authored-by: George <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: generall <[email protected]>
1 parent b7ec57e commit 5343849

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+556
-234
lines changed

.github/workflows/continuous-benchmark.yaml

Lines changed: 22 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -15,18 +15,29 @@ jobs:
1515
- uses: webfactory/[email protected]
1616
with:
1717
ssh-private-key: ${{ secrets.SSH_PRIVATE_KEY }}
18+
- name: Setup CI
19+
run: bash -x tools/setup_ci.sh
1820
- name: Benches
1921
run: |
20-
export HCLOUD_TOKEN=${{ secrets.HCLOUD_TOKEN }}
21-
export GCS_KEY=${{ secrets.GCS_KEY }}
22-
export GCS_SECRET=${{ secrets.GCS_SECRET }}
23-
export POSTGRES_PASSWORD=${{ secrets.POSTGRES_PASSWORD }}
24-
export POSTGRES_HOST=${{ secrets.POSTGRES_HOST }}
22+
export HCLOUD_TOKEN=${{ secrets.HCLOUD_TOKEN }}
23+
export GCS_KEY=${{ secrets.GCS_KEY }}
24+
export GCS_SECRET=${{ secrets.GCS_SECRET }}
25+
export POSTGRES_PASSWORD=${{ secrets.POSTGRES_PASSWORD }}
26+
export POSTGRES_HOST=${{ secrets.POSTGRES_HOST }}
2527
26-
# Benchmark the dev branch:
27-
export QDRANT_VERSION=ghcr/dev
28-
bash -x tools/run_ci.sh
28+
declare -A DATASET_TO_ENGINE
29+
DATASET_TO_ENGINE["laion-small-clip"]="qdrant-continuous-benchmark"
30+
DATASET_TO_ENGINE["msmarco-sparse-1M"]="qdrant-sparse-vector"
2931
30-
# Benchmark the master branch:
31-
export QDRANT_VERSION=docker/master
32-
bash -x tools/run_ci.sh
32+
for dataset in "${!DATASET_TO_ENGINE[@]}"; do
33+
export ENGINE_NAME=${DATASET_TO_ENGINE[$dataset]}
34+
export DATASETS=$dataset
35+
36+
# Benchmark the dev branch:
37+
export QDRANT_VERSION=ghcr/dev
38+
bash -x tools/run_ci.sh
39+
40+
# Benchmark the master branch:
41+
export QDRANT_VERSION=docker/master
42+
bash -x tools/run_ci.sh
43+
done

.pre-commit-config.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,10 @@ repos:
2828
- id: isort
2929
name: "Sort Imports"
3030
args: ["--profile", "black"]
31+
32+
- repo: https://github.com/astral-sh/ruff-pre-commit
33+
rev: v0.3.5
34+
hooks:
35+
# Run the linter.
36+
- id: ruff
37+
args: [ --fix ]

benchmark/dataset.py

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,20 +10,28 @@
1010
from dataset_reader.ann_h5_reader import AnnH5Reader
1111
from dataset_reader.base_reader import BaseReader
1212
from dataset_reader.json_reader import JSONReader
13+
from dataset_reader.sparse_reader import SparseReader
1314

1415

1516
@dataclass
1617
class DatasetConfig:
17-
vector_size: int
18-
distance: str
1918
name: str
2019
type: str
2120
path: str
21+
2222
link: Optional[str] = None
2323
schema: Optional[Dict[str, str]] = field(default_factory=dict)
24+
# None in case of sparse vectors:
25+
vector_size: Optional[int] = None
26+
distance: Optional[str] = None
2427

2528

26-
READER_TYPE = {"h5": AnnH5Reader, "jsonl": JSONReader, "tar": AnnCompoundReader}
29+
READER_TYPE = {
30+
"h5": AnnH5Reader,
31+
"jsonl": JSONReader,
32+
"tar": AnnCompoundReader,
33+
"sparse": SparseReader,
34+
}
2735

2836

2937
class Dataset:

dataset_reader/ann_compound_reader.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ def read_queries(self) -> Iterator[Query]:
3333
vector /= np.linalg.norm(vector)
3434
yield Query(
3535
vector=vector.tolist(),
36+
sparse_vector=None,
3637
meta_conditions=row_json["conditions"],
3738
expected_result=row_json["closest_ids"],
3839
expected_scores=row_json["closest_scores"],

dataset_reader/ann_h5_reader.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ def read_queries(self) -> Iterator[Query]:
2222
vector /= np.linalg.norm(vector)
2323
yield Query(
2424
vector=vector.tolist(),
25+
sparse_vector=None,
2526
meta_conditions=None,
2627
expected_result=expected_result.tolist(),
2728
expected_scores=expected_scores.tolist(),
@@ -33,7 +34,9 @@ def read_data(self) -> Iterator[Record]:
3334
for idx, vector in enumerate(data["train"]):
3435
if self.normalize:
3536
vector /= np.linalg.norm(vector)
36-
yield Record(id=idx, vector=vector.tolist(), metadata=None)
37+
yield Record(
38+
id=idx, vector=vector.tolist(), sparse_vector=None, metadata=None
39+
)
3740

3841

3942
if __name__ == "__main__":

dataset_reader/base_reader.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,24 @@
22
from typing import Iterator, List, Optional
33

44

5+
@dataclass
6+
class SparseVector:
7+
indices: List[int]
8+
values: List[float]
9+
10+
511
@dataclass
612
class Record:
713
id: int
8-
vector: List[float]
14+
vector: Optional[List[float]]
15+
sparse_vector: Optional[SparseVector]
916
metadata: Optional[dict]
1017

1118

1219
@dataclass
1320
class Query:
14-
vector: List[float]
21+
vector: Optional[List[float]]
22+
sparse_vector: Optional[SparseVector]
1523
meta_conditions: Optional[dict]
1624
expected_result: Optional[List[int]]
1725
expected_scores: Optional[List[float]] = None

dataset_reader/json_reader.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,13 +58,18 @@ def read_queries(self) -> Iterator[Query]:
5858
):
5959
# ToDo: add meta_conditions
6060

61-
yield Query(vector=vector, meta_conditions=None, expected_result=neighbours)
61+
yield Query(
62+
vector=vector,
63+
sparse_vector=None,
64+
meta_conditions=None,
65+
expected_result=neighbours,
66+
)
6267

6368
def read_data(self) -> Iterator[Record]:
6469
for idx, (vector, payload) in enumerate(
6570
zip(self.read_vectors(), self.read_payloads())
6671
):
67-
yield Record(id=idx, vector=vector, metadata=payload)
72+
yield Record(id=idx, vector=vector, sparse_vector=None, metadata=payload)
6873

6974

7075
if __name__ == "__main__":

dataset_reader/sparse_reader.py

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
import os
2+
from pathlib import Path
3+
from typing import Iterator, List, Tuple, Union
4+
5+
import numpy as np
6+
7+
from dataset_reader.base_reader import BaseReader, Query, Record, SparseVector
8+
9+
10+
def read_sparse_matrix_fields(
11+
filename: Union[Path, str]
12+
) -> Tuple[np.array, np.array, np.array]:
13+
"""Read the fields of a CSR matrix without instantiating it"""
14+
15+
with open(filename, "rb") as f:
16+
sizes = np.fromfile(f, dtype="int64", count=3)
17+
n_row, n_col, n_non_zero = sizes
18+
index_pointer = np.fromfile(f, dtype="int64", count=n_row + 1)
19+
assert n_non_zero == index_pointer[-1]
20+
columns = np.fromfile(f, dtype="int32", count=n_non_zero)
21+
assert np.all(columns >= 0) and np.all(columns < n_col)
22+
values = np.fromfile(f, dtype="float32", count=n_non_zero)
23+
return values, columns, index_pointer
24+
25+
26+
def csr_to_sparse_vectors(
27+
values: List[float], columns: List[int], index_pointer: List[int]
28+
) -> Iterator[SparseVector]:
29+
num_rows = len(index_pointer) - 1
30+
31+
for i in range(num_rows):
32+
start = index_pointer[i]
33+
end = index_pointer[i + 1]
34+
row_values, row_indices = [], []
35+
for j in range(start, end):
36+
row_values.append(values[j])
37+
row_indices.append(columns[j])
38+
yield SparseVector(indices=row_indices, values=row_values)
39+
40+
41+
def read_csr_matrix(filename: Union[Path, str]) -> Iterator[SparseVector]:
42+
"""Read a CSR matrix in spmat format"""
43+
values, columns, index_pointer = read_sparse_matrix_fields(filename)
44+
values = values.tolist()
45+
columns = columns.tolist()
46+
index_pointer = index_pointer.tolist()
47+
48+
yield from csr_to_sparse_vectors(values, columns, index_pointer)
49+
50+
51+
def knn_result_read(
52+
filename: Union[Path, str]
53+
) -> Tuple[List[List[int]], List[List[float]]]:
54+
n, d = map(int, np.fromfile(filename, dtype="uint32", count=2))
55+
assert os.stat(filename).st_size == 8 + n * d * (4 + 4)
56+
with open(filename, "rb") as f:
57+
f.seek(4 + 4)
58+
ids = np.fromfile(f, dtype="int32", count=n * d).reshape(n, d).tolist()
59+
scores = np.fromfile(f, dtype="float32", count=n * d).reshape(n, d).tolist()
60+
return ids, scores
61+
62+
63+
class SparseReader(BaseReader):
64+
def __init__(self, path, normalize=False):
65+
self.path = path
66+
self.normalize = normalize
67+
68+
def read_queries(self) -> Iterator[Query]:
69+
queries_path = self.path / "queries.csr"
70+
X = read_csr_matrix(queries_path)
71+
72+
gt_path = self.path / "results.gt"
73+
gt_indices, _ = knn_result_read(gt_path)
74+
75+
for i, sparse_vector in enumerate(X):
76+
yield Query(
77+
vector=None,
78+
sparse_vector=sparse_vector,
79+
meta_conditions=None,
80+
expected_result=gt_indices[i],
81+
)
82+
83+
def read_data(self) -> Iterator[Record]:
84+
data_path = self.path / "data.csr"
85+
X = read_csr_matrix(data_path)
86+
87+
for i, sparse_vector in enumerate(X):
88+
yield Record(id=i, vector=None, sparse_vector=sparse_vector, metadata=None)
89+
90+
91+
if __name__ == "__main__":
92+
vals = [1, 3, 2, 3, 6, 4, 5]
93+
cols = [0, 2, 2, 1, 3, 0, 2]
94+
pointers = [0, 2, 3, 5, 7]
95+
vecs = [vec for vec in csr_to_sparse_vectors(vals, cols, pointers)]
96+
97+
assert vecs[0] == SparseVector(indices=[0, 2], values=[1, 3])
98+
assert vecs[1] == SparseVector(indices=[2], values=[2])
99+
assert vecs[2] == SparseVector(indices=[1, 3], values=[3, 6])
100+
assert vecs[3] == SparseVector(indices=[0, 2], values=[4, 5])

datasets/datasets.json

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,18 @@
6666
"path": "dbpedia-openai-1M-1536-angular/dbpedia_openai_1M",
6767
"link": "https://storage.googleapis.com/ann-filtered-benchmark/datasets/dbpedia_openai_1M.tgz"
6868
},
69+
{
70+
"name": "msmarco-sparse-100K",
71+
"type": "sparse",
72+
"path": "msmarco-sparse/100K",
73+
"link": "https://storage.googleapis.com/ann-filtered-benchmark/datasets/msmacro-sparse-100K.tar.gz"
74+
},
75+
{
76+
"name": "msmarco-sparse-1M",
77+
"type": "sparse",
78+
"path": "msmarco-sparse/1M",
79+
"link": "https://storage.googleapis.com/ann-filtered-benchmark/datasets/msmacro-sparse-1M.tar.gz"
80+
},
6981
{
7082
"name": "h-and-m-2048-angular-filters",
7183
"vector_size": 2048,

engine/base_client/__init__.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,12 @@
66

77
class IncompatibilityError(Exception):
88
pass
9+
10+
11+
__all__ = [
12+
"BaseClient",
13+
"BaseConfigurator",
14+
"BaseSearcher",
15+
"BaseUploader",
16+
"IncompatibilityError",
17+
]

engine/base_client/client.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
import json
22
import os
33
from datetime import datetime
4-
from pathlib import Path
54
from typing import List
65

76
from benchmark import ROOT_DIR
@@ -31,6 +30,10 @@ def __init__(
3130
self.searchers = searchers
3231
self.engine = engine
3332

33+
@property
34+
def sparse_vector_support(self):
35+
return self.configurator.SPARSE_VECTOR_SUPPORT
36+
3437
def save_search_results(
3538
self, dataset_name: str, results: dict, search_id: int, search_params: dict
3639
):

engine/base_client/configure.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55

66
class BaseConfigurator:
7+
SPARSE_VECTOR_SUPPORT: bool = False
78
DISTANCE_MAPPING = {}
89

910
def __init__(self, host, collection_params: dict, connection_params: dict):

engine/base_client/search.py

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,13 +30,11 @@ def get_mp_start_method(cls):
3030
return None
3131

3232
@classmethod
33-
def search_one(
34-
cls, vector: List[float], meta_conditions, top: Optional[int]
35-
) -> List[Tuple[int, float]]:
33+
def search_one(cls, query: Query, top: Optional[int]) -> List[Tuple[int, float]]:
3634
raise NotImplementedError()
3735

3836
@classmethod
39-
def _search_one(cls, query, top: Optional[int] = None):
37+
def _search_one(cls, query: Query, top: Optional[int] = None):
4038
if top is None:
4139
top = (
4240
len(query.expected_result)
@@ -45,7 +43,7 @@ def _search_one(cls, query, top: Optional[int] = None):
4543
)
4644

4745
start = time.perf_counter()
48-
search_res = cls.search_one(query.vector, query.meta_conditions, top)
46+
search_res = cls.search_one(query, top)
4947
end = time.perf_counter()
5048

5149
precision = 1.0

engine/base_client/upload.py

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import time
22
from multiprocessing import get_context
3-
from typing import Iterable, List, Optional, Tuple
3+
from typing import Iterable, List
44

55
import tqdm
66

@@ -80,22 +80,17 @@ def upload(
8080
}
8181

8282
@classmethod
83-
def _upload_batch(
84-
cls, batch: Tuple[List[int], List[list], List[Optional[dict]]]
85-
) -> float:
86-
ids, vectors, metadata = batch
83+
def _upload_batch(cls, batch: List[Record]) -> float:
8784
start = time.perf_counter()
88-
cls.upload_batch(ids, vectors, metadata)
85+
cls.upload_batch(batch)
8986
return time.perf_counter() - start
9087

9188
@classmethod
9289
def post_upload(cls, distance):
9390
return {}
9491

9592
@classmethod
96-
def upload_batch(
97-
cls, ids: List[int], vectors: List[list], metadata: List[Optional[dict]]
98-
):
93+
def upload_batch(cls, batch: List[Record]):
9994
raise NotImplementedError()
10095

10196
@classmethod

0 commit comments

Comments
 (0)