Skip to content

Commit 14943b9

Browse files
Initial commit
0 parents  commit 14943b9

File tree

16 files changed

+1125
-0
lines changed

16 files changed

+1125
-0
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
.idea/
2+
__pycache__
3+
*.pyc

README.md

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# vector-db-benchmark
2+
3+
There are various vector search engines available, and each of them may offer
4+
a different set of features and efficiency. But how do we measure the
5+
performance? There is no clear definition and in a specific case you may worry
6+
about a specific thing, while not paying much attention to other aspects. This
7+
project is a general framework for benchmarking different engines under the
8+
same hardware constraints, so you can choose what works best for you.
9+
10+
Running any benchmark requires choosing an engine, a dataset and the scenario
11+
against which it should be tested.
12+
13+
## TL;DR
14+
15+
```shell
16+
python main.py \
17+
--engine qdrant-0.8.4 \
18+
--scenario scenario.load.MeasureLoadTimeSingleClient \
19+
--dataset random-100
20+
```
21+
22+
Will execute the benchmark scenario enclosed in a
23+
`scenario.load.MeasureLoadTimeSingleClient` class and use a `random-100`
24+
dataset. All the operation will be launched on a `qdrant-0.8.4` engine.
25+
26+
Expected output should look like following:
27+
28+
```shell
29+
mean(load::time) = 0.0015927800000000007
30+
```
31+
32+
### Backend
33+
34+
A specific way of managing the containers. Right now only Docker, but might be
35+
Docker Swarm or Kubernetes, so the benchmark is not executed on a single
36+
machine, but on several servers.
37+
38+
### Engine
39+
40+
There are various vector search projects available. Some of them are just pure
41+
libraries (like FAISS or Annoy) and they offer great performance, but doesn't
42+
fit well any production systems. Those could be also benchmarked, however the
43+
primary focus is on vector databases using client-server architecture.
44+
45+
All the engine configurations are kept in `./engine` subdirectories.
46+
47+
Each engine has its own configuration defined in `config.json` file:
48+
49+
```json
50+
{
51+
"server": {
52+
"image": "qdrant/qdrant:v0.8.4",
53+
"hostname": "qdrant_server",
54+
"environment": {
55+
"DEBUG": true
56+
}
57+
},
58+
"client": {
59+
"dockerfile": "client.Dockerfile",
60+
"main": "python cmd.py"
61+
}
62+
}
63+
```
64+
65+
- Either `image` or `dockerfile` has to be defined, similar to
66+
`docker-compose.yaml` file. The `dockerfile` has a precedence over `image`
67+
- The `main` parameter points to a main client script which takes parameters.
68+
Those parameters define the operations to perform with a client library.
69+
70+
#### Server
71+
72+
The server is a process, or a bunch of processes, responsible for creating
73+
vector indexes and handling all the user requests. It may be run on a single
74+
machine, or in case of some engines using the distributed mode (**in the future**).
75+
76+
#### Client
77+
78+
A client process performing all the operations, as it would be typically done in
79+
any client-server based communication. There might be several clients launched
80+
in parallel and each of them might be using part of the data. The number of
81+
clients depends on the scenario.
82+
83+
Each client has to define a main script which takes some parameters and allow
84+
performing typical CRUD-like operations. For now there is only one operation
85+
supported:
86+
87+
- `load [path-to-file]`
88+
89+
If the scenario attempts to load the data from a given file, then it will call
90+
the following command:
91+
92+
`python cmd.py load vectors.jsonl`
93+
94+
The main script has to handle the conversion and load operations.
95+
96+
By introducing a main script, we can allow using different client libraries, if
97+
available, so there is no assumption about the language used, as long as it can
98+
accept parameters.
99+
100+
### Dataset
101+
102+
Consists of vectors and/or payloads. Scenario decides what to do with the data.
103+
104+
## Metrics
105+
106+
Metrics are being measured by the clients themselves and displayed on stdout.
107+
The benchmark will collect all the metrics and display some statistics at the
108+
end of each test.
109+
110+
All the displayed metrics should be printed in the following way:
111+
112+
```shell
113+
phase::kpi_name = 0.242142
114+
```
115+
116+
Where `0.242142` is a numerical value specific for the `kpi_name`. In the
117+
simplest case that might be a time spent in a specific operation, like:
118+
119+
```
120+
load::time = 0.0052424
121+
```
122+
123+
## Open topics
124+
125+
1. The list of supported KPIs should be still established and implemented by
126+
every single engine, so can be tracked in all the benchmark scenarios.
127+
2. What should be the format supported in the datasets? JSON lines are cross
128+
language and platform, what makes them easy to be parsed to whatever format
129+
a specific engine support.
130+
3. Should the scenario be tightly-coupled with the dataset or allow using
131+
different datasets? For simpler cases that may work, but there might be some
132+
specific problems that won't be possible for each dataset.
133+
4. How do we handle engine errors?
134+
5. Dataset should be also represented by a class instance:
135+
- that will give a possibility to not assume the filenames in scenario
136+
- it will be easier to deal with paths
137+
The dataset should also have a file-based config, like engine.

benchmark/__init__.py

Whitespace-only changes.

benchmark/backend/__init__.py

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
import abc
2+
import tempfile
3+
from pathlib import Path
4+
from typing import Text, Union
5+
6+
PathLike = Union[Text, Path]
7+
8+
9+
class Container(abc.ABC):
10+
"""
11+
An abstraction over a container, which is a machine running either the
12+
server or client of the engine.
13+
"""
14+
15+
def run(self):
16+
"""
17+
Start the container using the backend
18+
:return:
19+
"""
20+
...
21+
22+
def is_ready(self) -> bool:
23+
"""
24+
A healthcheck, making sure the container is properly set up.
25+
:return: True, if ready to proceed, False otherwise
26+
"""
27+
...
28+
29+
30+
class Server(Container, abc.ABC):
31+
pass
32+
33+
34+
class Client(Container, abc.ABC):
35+
"""
36+
An abstract client of the selected engine.
37+
"""
38+
39+
def load_data(self, filename: Text):
40+
"""
41+
Loads the data with a provided filename into the selected search engine.
42+
This is engine-specific operation, that has the possibility to
43+
:param filename: a relative path from the dataset directory
44+
:return:
45+
"""
46+
...
47+
48+
49+
class Backend:
50+
"""
51+
A base class for all the possible benchmark backends.
52+
"""
53+
54+
def __init__(self, root_dir: Union[PathLike]):
55+
self.root_dir = root_dir if isinstance(root_dir, Path) else Path(root_dir)
56+
self.temp_dir = None
57+
58+
def __enter__(self):
59+
self.temp_dir = tempfile.TemporaryDirectory()
60+
self.temp_dir.__enter__()
61+
return self
62+
63+
def __exit__(self, exc_type, exc_val, exc_tb):
64+
self.temp_dir.__exit__(exc_type, exc_val, exc_tb)
65+
66+
def initialize_server(self, engine: Text) -> Server:
67+
...
68+
69+
def initialize_client(self, engine: Text) -> Client:
70+
...

benchmark/backend/docker.py

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
import json
2+
from dataclasses import dataclass
3+
from pathlib import Path
4+
from typing import Text, Union, Optional, Dict, List
5+
6+
from benchmark.backend import Backend, PathLike, Server, Client, Container
7+
from docker.models import containers
8+
9+
import logging
10+
import docker
11+
12+
13+
logger = logging.getLogger(__name__)
14+
15+
16+
@dataclass
17+
class DockerContainerConf:
18+
engine: Text
19+
image: Optional[Text] = None
20+
dockerfile: Optional[Text] = None
21+
environment: Optional[Dict[Text, Union[Text, int, bool]]] = None
22+
main: Optional[Text] = None
23+
hostname: Optional[Text] = None
24+
25+
@classmethod
26+
def from_file(
27+
cls, path: Text, engine: Text, container: Text = "server"
28+
) -> "DockerContainerConf":
29+
with open(path, "r") as fp:
30+
conf = json.load(fp)
31+
return DockerContainerConf(engine=engine, **conf[container])
32+
33+
def dockerfile_path(self, root_dir: Path) -> Path:
34+
"""
35+
Calculates the absolute path to the directory containing the dockerfile,
36+
using given root directory as a base.
37+
:param root_dir:
38+
:return:
39+
"""
40+
return root_dir / "engine" / self.engine
41+
42+
43+
class DockerContainer(Container):
44+
def __init__(
45+
self,
46+
container_conf: DockerContainerConf,
47+
docker_backend: "DockerBackend",
48+
):
49+
self.container_conf = container_conf
50+
self.docker_backend = docker_backend
51+
self.container: containers.Container = None
52+
self.volumes = []
53+
54+
def mount(self, source: PathLike, target: PathLike):
55+
self.volumes.append(f"{source}:{target}")
56+
57+
def run(self):
58+
# Build the dockerfile if it was provided as a container image. This is
59+
# typically done for the clients, as they may require some custom setup
60+
if self.container_conf.dockerfile is not None:
61+
dockerfile_path = self.container_conf.dockerfile_path(
62+
self.docker_backend.root_dir
63+
)
64+
image, logs = self.docker_backend.docker_client.images.build(
65+
path=str(dockerfile_path),
66+
dockerfile=self.container_conf.dockerfile,
67+
)
68+
self.container_conf.image = image.id
69+
logger.info(
70+
"Built %s into a Docker image %s",
71+
self.container_conf.dockerfile,
72+
image.id,
73+
)
74+
75+
# Create the container either using the image or dockerfile, if that was
76+
# provided. The dockerfile has a preference over the image name.
77+
logger.debug("Running a container using image %s", self.container_conf.image)
78+
self.container = self.docker_backend.docker_client.containers.run(
79+
self.container_conf.image,
80+
detach=True,
81+
volumes=self.volumes,
82+
environment=self.container_conf.environment,
83+
hostname=self.container_conf.hostname,
84+
network=self.docker_backend.network.name,
85+
)
86+
87+
# TODO: remove the image on exit
88+
89+
def logs(self):
90+
for log_entry in self.container.logs(stream=True, follow=True):
91+
yield log_entry
92+
93+
def is_ready(self) -> bool:
94+
# TODO: implement the healthcheck
95+
return True
96+
97+
98+
class DockerServer(Server, DockerContainer):
99+
pass
100+
101+
102+
class DockerClient(Client, DockerContainer):
103+
def load_data(self, filename: Text):
104+
command = f"{self.container_conf.main} load {filename}"
105+
_, generator = self.container.exec_run(command, stream=True)
106+
return generator
107+
108+
109+
class DockerBackend(Backend):
110+
"""
111+
A Docker based backend for the benchmarks, using separate containers for
112+
server and client/s.
113+
"""
114+
115+
NETWORK_NAME = "vector-benchmark"
116+
117+
def __init__(
118+
self,
119+
root_dir: Union[PathLike],
120+
docker_client: Optional[docker.DockerClient] = None,
121+
):
122+
super().__init__(root_dir)
123+
if docker_client is None:
124+
docker_client = docker.from_env()
125+
self.docker_client = docker_client
126+
self.containers: List[DockerContainer] = []
127+
128+
def __enter__(self):
129+
super().__enter__()
130+
self.network = self.docker_client.networks.create(self.NETWORK_NAME)
131+
# self.data_volume = self.docker_client.volumes.create()
132+
return self
133+
134+
def __exit__(self, exc_type, exc_val, exc_tb):
135+
super().__exit__(exc_type, exc_val, exc_tb)
136+
137+
# Kill all the containers on the context manager exit, so there are no
138+
# orphaned containers once the benchmark is finished
139+
for container in self.containers:
140+
container.container.kill()
141+
142+
# Remove the data volume as well, so there won't be any volume left
143+
# self.data_volume.remove()
144+
145+
# Finally get rid of the network as well
146+
self.network.remove()
147+
148+
def initialize_server(self, engine: Text) -> Server:
149+
server_conf = DockerContainerConf.from_file(
150+
self.root_dir / "engine" / engine / "config.json",
151+
engine=engine,
152+
container="server",
153+
)
154+
logger.info("Initializing %s server: %s", engine, server_conf)
155+
server = DockerServer(server_conf, self)
156+
self.containers.append(server)
157+
return server
158+
159+
def initialize_client(self, engine: Text) -> Client:
160+
# TODO: Create a docker volume so the data is available on client instances
161+
client_conf = DockerContainerConf.from_file(
162+
self.root_dir / "engine" / engine / "config.json",
163+
engine=engine,
164+
container="client",
165+
)
166+
logger.info("Initializing %s client: %s", engine, client_conf)
167+
client = DockerClient(client_conf, self)
168+
self.containers.append(client)
169+
return client

0 commit comments

Comments
 (0)