-
Notifications
You must be signed in to change notification settings - Fork 92
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Add sparse vectors benchmark support for Qdrant (#114)
* feat: Add sparse vectors benchmark support in Qdrant * fix: Self review * feat: Add sparse dataset for CI benchmarks * feat: Introduce SparseVector class * feat: Disallow sparse vector dataset being run with non sparse vector engine configs * feat: use different engine config to run sparse vector benchmarks * fix: use different engine config to run sparse vector benchmarks * feat: Optimize CI benchmarks workflow * feat: Add 1M sparse dataset * fix: remove scipy, read csr matrix manually (#117) * fix: remove scipy, read csr matrix manually * fix: Dataset query reader should have sparse_vector=None by default * refactor: Changes based on feedback * refactoring: refactor sparse vector support (#118) * refactoring: refactor sparse vector support * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * feat: Use pydantic construct * refactor: Update all engines to use Query and Record dataclasses (#116) * refactor: Update all engines to use Query and Record dataclasses * feat: Add ruff in pre-commit hooks * fix: Type mismatches * fix: Redis search client types and var names * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: Type issues detected by linter * fix: iter_batches func type * refactor: knn_conditions should be class level constant --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: Type issue * fix: Allow python 3.8 since scipy is now removed * fix: Add missing redis-m-16-ef-128 config * fix: redis container port * fix linter --------- Co-authored-by: George <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: generall <[email protected]>
- Loading branch information
1 parent
b7ec57e
commit 5343849
Showing
49 changed files
with
556 additions
and
234 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,18 +15,29 @@ jobs: | |
- uses: webfactory/[email protected] | ||
with: | ||
ssh-private-key: ${{ secrets.SSH_PRIVATE_KEY }} | ||
- name: Setup CI | ||
run: bash -x tools/setup_ci.sh | ||
- name: Benches | ||
run: | | ||
export HCLOUD_TOKEN=${{ secrets.HCLOUD_TOKEN }} | ||
export GCS_KEY=${{ secrets.GCS_KEY }} | ||
export GCS_SECRET=${{ secrets.GCS_SECRET }} | ||
export POSTGRES_PASSWORD=${{ secrets.POSTGRES_PASSWORD }} | ||
export POSTGRES_HOST=${{ secrets.POSTGRES_HOST }} | ||
export HCLOUD_TOKEN=${{ secrets.HCLOUD_TOKEN }} | ||
export GCS_KEY=${{ secrets.GCS_KEY }} | ||
export GCS_SECRET=${{ secrets.GCS_SECRET }} | ||
export POSTGRES_PASSWORD=${{ secrets.POSTGRES_PASSWORD }} | ||
export POSTGRES_HOST=${{ secrets.POSTGRES_HOST }} | ||
# Benchmark the dev branch: | ||
export QDRANT_VERSION=ghcr/dev | ||
bash -x tools/run_ci.sh | ||
declare -A DATASET_TO_ENGINE | ||
DATASET_TO_ENGINE["laion-small-clip"]="qdrant-continuous-benchmark" | ||
DATASET_TO_ENGINE["msmarco-sparse-1M"]="qdrant-sparse-vector" | ||
# Benchmark the master branch: | ||
export QDRANT_VERSION=docker/master | ||
bash -x tools/run_ci.sh | ||
for dataset in "${!DATASET_TO_ENGINE[@]}"; do | ||
export ENGINE_NAME=${DATASET_TO_ENGINE[$dataset]} | ||
export DATASETS=$dataset | ||
# Benchmark the dev branch: | ||
export QDRANT_VERSION=ghcr/dev | ||
bash -x tools/run_ci.sh | ||
# Benchmark the master branch: | ||
export QDRANT_VERSION=docker/master | ||
bash -x tools/run_ci.sh | ||
done |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
import os | ||
from pathlib import Path | ||
from typing import Iterator, List, Tuple, Union | ||
|
||
import numpy as np | ||
|
||
from dataset_reader.base_reader import BaseReader, Query, Record, SparseVector | ||
|
||
|
||
def read_sparse_matrix_fields( | ||
filename: Union[Path, str] | ||
) -> Tuple[np.array, np.array, np.array]: | ||
"""Read the fields of a CSR matrix without instantiating it""" | ||
|
||
with open(filename, "rb") as f: | ||
sizes = np.fromfile(f, dtype="int64", count=3) | ||
n_row, n_col, n_non_zero = sizes | ||
index_pointer = np.fromfile(f, dtype="int64", count=n_row + 1) | ||
assert n_non_zero == index_pointer[-1] | ||
columns = np.fromfile(f, dtype="int32", count=n_non_zero) | ||
assert np.all(columns >= 0) and np.all(columns < n_col) | ||
values = np.fromfile(f, dtype="float32", count=n_non_zero) | ||
return values, columns, index_pointer | ||
|
||
|
||
def csr_to_sparse_vectors( | ||
values: List[float], columns: List[int], index_pointer: List[int] | ||
) -> Iterator[SparseVector]: | ||
num_rows = len(index_pointer) - 1 | ||
|
||
for i in range(num_rows): | ||
start = index_pointer[i] | ||
end = index_pointer[i + 1] | ||
row_values, row_indices = [], [] | ||
for j in range(start, end): | ||
row_values.append(values[j]) | ||
row_indices.append(columns[j]) | ||
yield SparseVector(indices=row_indices, values=row_values) | ||
|
||
|
||
def read_csr_matrix(filename: Union[Path, str]) -> Iterator[SparseVector]: | ||
"""Read a CSR matrix in spmat format""" | ||
values, columns, index_pointer = read_sparse_matrix_fields(filename) | ||
values = values.tolist() | ||
columns = columns.tolist() | ||
index_pointer = index_pointer.tolist() | ||
|
||
yield from csr_to_sparse_vectors(values, columns, index_pointer) | ||
|
||
|
||
def knn_result_read( | ||
filename: Union[Path, str] | ||
) -> Tuple[List[List[int]], List[List[float]]]: | ||
n, d = map(int, np.fromfile(filename, dtype="uint32", count=2)) | ||
assert os.stat(filename).st_size == 8 + n * d * (4 + 4) | ||
with open(filename, "rb") as f: | ||
f.seek(4 + 4) | ||
ids = np.fromfile(f, dtype="int32", count=n * d).reshape(n, d).tolist() | ||
scores = np.fromfile(f, dtype="float32", count=n * d).reshape(n, d).tolist() | ||
return ids, scores | ||
|
||
|
||
class SparseReader(BaseReader): | ||
def __init__(self, path, normalize=False): | ||
self.path = path | ||
self.normalize = normalize | ||
|
||
def read_queries(self) -> Iterator[Query]: | ||
queries_path = self.path / "queries.csr" | ||
X = read_csr_matrix(queries_path) | ||
|
||
gt_path = self.path / "results.gt" | ||
gt_indices, _ = knn_result_read(gt_path) | ||
|
||
for i, sparse_vector in enumerate(X): | ||
yield Query( | ||
vector=None, | ||
sparse_vector=sparse_vector, | ||
meta_conditions=None, | ||
expected_result=gt_indices[i], | ||
) | ||
|
||
def read_data(self) -> Iterator[Record]: | ||
data_path = self.path / "data.csr" | ||
X = read_csr_matrix(data_path) | ||
|
||
for i, sparse_vector in enumerate(X): | ||
yield Record(id=i, vector=None, sparse_vector=sparse_vector, metadata=None) | ||
|
||
|
||
if __name__ == "__main__": | ||
vals = [1, 3, 2, 3, 6, 4, 5] | ||
cols = [0, 2, 2, 1, 3, 0, 2] | ||
pointers = [0, 2, 3, 5, 7] | ||
vecs = [vec for vec in csr_to_sparse_vectors(vals, cols, pointers)] | ||
|
||
assert vecs[0] == SparseVector(indices=[0, 2], values=[1, 3]) | ||
assert vecs[1] == SparseVector(indices=[2], values=[2]) | ||
assert vecs[2] == SparseVector(indices=[1, 3], values=[3, 6]) | ||
assert vecs[3] == SparseVector(indices=[0, 2], values=[4, 5]) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.