-
Notifications
You must be signed in to change notification settings - Fork 92
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 14943b9
Showing
16 changed files
with
1,125 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
.idea/ | ||
__pycache__ | ||
*.pyc |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,137 @@ | ||
# vector-db-benchmark | ||
|
||
There are various vector search engines available, and each of them may offer | ||
a different set of features and efficiency. But how do we measure the | ||
performance? There is no clear definition and in a specific case you may worry | ||
about a specific thing, while not paying much attention to other aspects. This | ||
project is a general framework for benchmarking different engines under the | ||
same hardware constraints, so you can choose what works best for you. | ||
|
||
Running any benchmark requires choosing an engine, a dataset and the scenario | ||
against which it should be tested. | ||
|
||
## TL;DR | ||
|
||
```shell | ||
python main.py \ | ||
--engine qdrant-0.8.4 \ | ||
--scenario scenario.load.MeasureLoadTimeSingleClient \ | ||
--dataset random-100 | ||
``` | ||
|
||
Will execute the benchmark scenario enclosed in a | ||
`scenario.load.MeasureLoadTimeSingleClient` class and use a `random-100` | ||
dataset. All the operation will be launched on a `qdrant-0.8.4` engine. | ||
|
||
Expected output should look like following: | ||
|
||
```shell | ||
mean(load::time) = 0.0015927800000000007 | ||
``` | ||
|
||
### Backend | ||
|
||
A specific way of managing the containers. Right now only Docker, but might be | ||
Docker Swarm or Kubernetes, so the benchmark is not executed on a single | ||
machine, but on several servers. | ||
|
||
### Engine | ||
|
||
There are various vector search projects available. Some of them are just pure | ||
libraries (like FAISS or Annoy) and they offer great performance, but doesn't | ||
fit well any production systems. Those could be also benchmarked, however the | ||
primary focus is on vector databases using client-server architecture. | ||
|
||
All the engine configurations are kept in `./engine` subdirectories. | ||
|
||
Each engine has its own configuration defined in `config.json` file: | ||
|
||
```json | ||
{ | ||
"server": { | ||
"image": "qdrant/qdrant:v0.8.4", | ||
"hostname": "qdrant_server", | ||
"environment": { | ||
"DEBUG": true | ||
} | ||
}, | ||
"client": { | ||
"dockerfile": "client.Dockerfile", | ||
"main": "python cmd.py" | ||
} | ||
} | ||
``` | ||
|
||
- Either `image` or `dockerfile` has to be defined, similar to | ||
`docker-compose.yaml` file. The `dockerfile` has a precedence over `image` | ||
- The `main` parameter points to a main client script which takes parameters. | ||
Those parameters define the operations to perform with a client library. | ||
|
||
#### Server | ||
|
||
The server is a process, or a bunch of processes, responsible for creating | ||
vector indexes and handling all the user requests. It may be run on a single | ||
machine, or in case of some engines using the distributed mode (**in the future**). | ||
|
||
#### Client | ||
|
||
A client process performing all the operations, as it would be typically done in | ||
any client-server based communication. There might be several clients launched | ||
in parallel and each of them might be using part of the data. The number of | ||
clients depends on the scenario. | ||
|
||
Each client has to define a main script which takes some parameters and allow | ||
performing typical CRUD-like operations. For now there is only one operation | ||
supported: | ||
|
||
- `load [path-to-file]` | ||
|
||
If the scenario attempts to load the data from a given file, then it will call | ||
the following command: | ||
|
||
`python cmd.py load vectors.jsonl` | ||
|
||
The main script has to handle the conversion and load operations. | ||
|
||
By introducing a main script, we can allow using different client libraries, if | ||
available, so there is no assumption about the language used, as long as it can | ||
accept parameters. | ||
|
||
### Dataset | ||
|
||
Consists of vectors and/or payloads. Scenario decides what to do with the data. | ||
|
||
## Metrics | ||
|
||
Metrics are being measured by the clients themselves and displayed on stdout. | ||
The benchmark will collect all the metrics and display some statistics at the | ||
end of each test. | ||
|
||
All the displayed metrics should be printed in the following way: | ||
|
||
```shell | ||
phase::kpi_name = 0.242142 | ||
``` | ||
|
||
Where `0.242142` is a numerical value specific for the `kpi_name`. In the | ||
simplest case that might be a time spent in a specific operation, like: | ||
|
||
``` | ||
load::time = 0.0052424 | ||
``` | ||
|
||
## Open topics | ||
|
||
1. The list of supported KPIs should be still established and implemented by | ||
every single engine, so can be tracked in all the benchmark scenarios. | ||
2. What should be the format supported in the datasets? JSON lines are cross | ||
language and platform, what makes them easy to be parsed to whatever format | ||
a specific engine support. | ||
3. Should the scenario be tightly-coupled with the dataset or allow using | ||
different datasets? For simpler cases that may work, but there might be some | ||
specific problems that won't be possible for each dataset. | ||
4. How do we handle engine errors? | ||
5. Dataset should be also represented by a class instance: | ||
- that will give a possibility to not assume the filenames in scenario | ||
- it will be easier to deal with paths | ||
The dataset should also have a file-based config, like engine. |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
import abc | ||
import tempfile | ||
from pathlib import Path | ||
from typing import Text, Union | ||
|
||
PathLike = Union[Text, Path] | ||
|
||
|
||
class Container(abc.ABC): | ||
""" | ||
An abstraction over a container, which is a machine running either the | ||
server or client of the engine. | ||
""" | ||
|
||
def run(self): | ||
""" | ||
Start the container using the backend | ||
:return: | ||
""" | ||
... | ||
|
||
def is_ready(self) -> bool: | ||
""" | ||
A healthcheck, making sure the container is properly set up. | ||
:return: True, if ready to proceed, False otherwise | ||
""" | ||
... | ||
|
||
|
||
class Server(Container, abc.ABC): | ||
pass | ||
|
||
|
||
class Client(Container, abc.ABC): | ||
""" | ||
An abstract client of the selected engine. | ||
""" | ||
|
||
def load_data(self, filename: Text): | ||
""" | ||
Loads the data with a provided filename into the selected search engine. | ||
This is engine-specific operation, that has the possibility to | ||
:param filename: a relative path from the dataset directory | ||
:return: | ||
""" | ||
... | ||
|
||
|
||
class Backend: | ||
""" | ||
A base class for all the possible benchmark backends. | ||
""" | ||
|
||
def __init__(self, root_dir: Union[PathLike]): | ||
self.root_dir = root_dir if isinstance(root_dir, Path) else Path(root_dir) | ||
self.temp_dir = None | ||
|
||
def __enter__(self): | ||
self.temp_dir = tempfile.TemporaryDirectory() | ||
self.temp_dir.__enter__() | ||
return self | ||
|
||
def __exit__(self, exc_type, exc_val, exc_tb): | ||
self.temp_dir.__exit__(exc_type, exc_val, exc_tb) | ||
|
||
def initialize_server(self, engine: Text) -> Server: | ||
... | ||
|
||
def initialize_client(self, engine: Text) -> Client: | ||
... |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,169 @@ | ||
import json | ||
from dataclasses import dataclass | ||
from pathlib import Path | ||
from typing import Text, Union, Optional, Dict, List | ||
|
||
from benchmark.backend import Backend, PathLike, Server, Client, Container | ||
from docker.models import containers | ||
|
||
import logging | ||
import docker | ||
|
||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
@dataclass | ||
class DockerContainerConf: | ||
engine: Text | ||
image: Optional[Text] = None | ||
dockerfile: Optional[Text] = None | ||
environment: Optional[Dict[Text, Union[Text, int, bool]]] = None | ||
main: Optional[Text] = None | ||
hostname: Optional[Text] = None | ||
|
||
@classmethod | ||
def from_file( | ||
cls, path: Text, engine: Text, container: Text = "server" | ||
) -> "DockerContainerConf": | ||
with open(path, "r") as fp: | ||
conf = json.load(fp) | ||
return DockerContainerConf(engine=engine, **conf[container]) | ||
|
||
def dockerfile_path(self, root_dir: Path) -> Path: | ||
""" | ||
Calculates the absolute path to the directory containing the dockerfile, | ||
using given root directory as a base. | ||
:param root_dir: | ||
:return: | ||
""" | ||
return root_dir / "engine" / self.engine | ||
|
||
|
||
class DockerContainer(Container): | ||
def __init__( | ||
self, | ||
container_conf: DockerContainerConf, | ||
docker_backend: "DockerBackend", | ||
): | ||
self.container_conf = container_conf | ||
self.docker_backend = docker_backend | ||
self.container: containers.Container = None | ||
self.volumes = [] | ||
|
||
def mount(self, source: PathLike, target: PathLike): | ||
self.volumes.append(f"{source}:{target}") | ||
|
||
def run(self): | ||
# Build the dockerfile if it was provided as a container image. This is | ||
# typically done for the clients, as they may require some custom setup | ||
if self.container_conf.dockerfile is not None: | ||
dockerfile_path = self.container_conf.dockerfile_path( | ||
self.docker_backend.root_dir | ||
) | ||
image, logs = self.docker_backend.docker_client.images.build( | ||
path=str(dockerfile_path), | ||
dockerfile=self.container_conf.dockerfile, | ||
) | ||
self.container_conf.image = image.id | ||
logger.info( | ||
"Built %s into a Docker image %s", | ||
self.container_conf.dockerfile, | ||
image.id, | ||
) | ||
|
||
# Create the container either using the image or dockerfile, if that was | ||
# provided. The dockerfile has a preference over the image name. | ||
logger.debug("Running a container using image %s", self.container_conf.image) | ||
self.container = self.docker_backend.docker_client.containers.run( | ||
self.container_conf.image, | ||
detach=True, | ||
volumes=self.volumes, | ||
environment=self.container_conf.environment, | ||
hostname=self.container_conf.hostname, | ||
network=self.docker_backend.network.name, | ||
) | ||
|
||
# TODO: remove the image on exit | ||
|
||
def logs(self): | ||
for log_entry in self.container.logs(stream=True, follow=True): | ||
yield log_entry | ||
|
||
def is_ready(self) -> bool: | ||
# TODO: implement the healthcheck | ||
return True | ||
|
||
|
||
class DockerServer(Server, DockerContainer): | ||
pass | ||
|
||
|
||
class DockerClient(Client, DockerContainer): | ||
def load_data(self, filename: Text): | ||
command = f"{self.container_conf.main} load {filename}" | ||
_, generator = self.container.exec_run(command, stream=True) | ||
return generator | ||
|
||
|
||
class DockerBackend(Backend): | ||
""" | ||
A Docker based backend for the benchmarks, using separate containers for | ||
server and client/s. | ||
""" | ||
|
||
NETWORK_NAME = "vector-benchmark" | ||
|
||
def __init__( | ||
self, | ||
root_dir: Union[PathLike], | ||
docker_client: Optional[docker.DockerClient] = None, | ||
): | ||
super().__init__(root_dir) | ||
if docker_client is None: | ||
docker_client = docker.from_env() | ||
self.docker_client = docker_client | ||
self.containers: List[DockerContainer] = [] | ||
|
||
def __enter__(self): | ||
super().__enter__() | ||
self.network = self.docker_client.networks.create(self.NETWORK_NAME) | ||
# self.data_volume = self.docker_client.volumes.create() | ||
return self | ||
|
||
def __exit__(self, exc_type, exc_val, exc_tb): | ||
super().__exit__(exc_type, exc_val, exc_tb) | ||
|
||
# Kill all the containers on the context manager exit, so there are no | ||
# orphaned containers once the benchmark is finished | ||
for container in self.containers: | ||
container.container.kill() | ||
|
||
# Remove the data volume as well, so there won't be any volume left | ||
# self.data_volume.remove() | ||
|
||
# Finally get rid of the network as well | ||
self.network.remove() | ||
|
||
def initialize_server(self, engine: Text) -> Server: | ||
server_conf = DockerContainerConf.from_file( | ||
self.root_dir / "engine" / engine / "config.json", | ||
engine=engine, | ||
container="server", | ||
) | ||
logger.info("Initializing %s server: %s", engine, server_conf) | ||
server = DockerServer(server_conf, self) | ||
self.containers.append(server) | ||
return server | ||
|
||
def initialize_client(self, engine: Text) -> Client: | ||
# TODO: Create a docker volume so the data is available on client instances | ||
client_conf = DockerContainerConf.from_file( | ||
self.root_dir / "engine" / engine / "config.json", | ||
engine=engine, | ||
container="client", | ||
) | ||
logger.info("Initializing %s client: %s", engine, client_conf) | ||
client = DockerClient(client_conf, self) | ||
self.containers.append(client) | ||
return client |
Oops, something went wrong.