Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
kacperlukawski committed Jul 12, 2022
0 parents commit 14943b9
Show file tree
Hide file tree
Showing 16 changed files with 1,125 additions and 0 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.idea/
__pycache__
*.pyc
137 changes: 137 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# vector-db-benchmark

There are various vector search engines available, and each of them may offer
a different set of features and efficiency. But how do we measure the
performance? There is no clear definition and in a specific case you may worry
about a specific thing, while not paying much attention to other aspects. This
project is a general framework for benchmarking different engines under the
same hardware constraints, so you can choose what works best for you.

Running any benchmark requires choosing an engine, a dataset and the scenario
against which it should be tested.

## TL;DR

```shell
python main.py \
--engine qdrant-0.8.4 \
--scenario scenario.load.MeasureLoadTimeSingleClient \
--dataset random-100
```

Will execute the benchmark scenario enclosed in a
`scenario.load.MeasureLoadTimeSingleClient` class and use a `random-100`
dataset. All the operation will be launched on a `qdrant-0.8.4` engine.

Expected output should look like following:

```shell
mean(load::time) = 0.0015927800000000007
```

### Backend

A specific way of managing the containers. Right now only Docker, but might be
Docker Swarm or Kubernetes, so the benchmark is not executed on a single
machine, but on several servers.

### Engine

There are various vector search projects available. Some of them are just pure
libraries (like FAISS or Annoy) and they offer great performance, but doesn't
fit well any production systems. Those could be also benchmarked, however the
primary focus is on vector databases using client-server architecture.

All the engine configurations are kept in `./engine` subdirectories.

Each engine has its own configuration defined in `config.json` file:

```json
{
"server": {
"image": "qdrant/qdrant:v0.8.4",
"hostname": "qdrant_server",
"environment": {
"DEBUG": true
}
},
"client": {
"dockerfile": "client.Dockerfile",
"main": "python cmd.py"
}
}
```

- Either `image` or `dockerfile` has to be defined, similar to
`docker-compose.yaml` file. The `dockerfile` has a precedence over `image`
- The `main` parameter points to a main client script which takes parameters.
Those parameters define the operations to perform with a client library.

#### Server

The server is a process, or a bunch of processes, responsible for creating
vector indexes and handling all the user requests. It may be run on a single
machine, or in case of some engines using the distributed mode (**in the future**).

#### Client

A client process performing all the operations, as it would be typically done in
any client-server based communication. There might be several clients launched
in parallel and each of them might be using part of the data. The number of
clients depends on the scenario.

Each client has to define a main script which takes some parameters and allow
performing typical CRUD-like operations. For now there is only one operation
supported:

- `load [path-to-file]`

If the scenario attempts to load the data from a given file, then it will call
the following command:

`python cmd.py load vectors.jsonl`

The main script has to handle the conversion and load operations.

By introducing a main script, we can allow using different client libraries, if
available, so there is no assumption about the language used, as long as it can
accept parameters.

### Dataset

Consists of vectors and/or payloads. Scenario decides what to do with the data.

## Metrics

Metrics are being measured by the clients themselves and displayed on stdout.
The benchmark will collect all the metrics and display some statistics at the
end of each test.

All the displayed metrics should be printed in the following way:

```shell
phase::kpi_name = 0.242142
```

Where `0.242142` is a numerical value specific for the `kpi_name`. In the
simplest case that might be a time spent in a specific operation, like:

```
load::time = 0.0052424
```

## Open topics

1. The list of supported KPIs should be still established and implemented by
every single engine, so can be tracked in all the benchmark scenarios.
2. What should be the format supported in the datasets? JSON lines are cross
language and platform, what makes them easy to be parsed to whatever format
a specific engine support.
3. Should the scenario be tightly-coupled with the dataset or allow using
different datasets? For simpler cases that may work, but there might be some
specific problems that won't be possible for each dataset.
4. How do we handle engine errors?
5. Dataset should be also represented by a class instance:
- that will give a possibility to not assume the filenames in scenario
- it will be easier to deal with paths
The dataset should also have a file-based config, like engine.
Empty file added benchmark/__init__.py
Empty file.
70 changes: 70 additions & 0 deletions benchmark/backend/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
import abc
import tempfile
from pathlib import Path
from typing import Text, Union

PathLike = Union[Text, Path]


class Container(abc.ABC):
"""
An abstraction over a container, which is a machine running either the
server or client of the engine.
"""

def run(self):
"""
Start the container using the backend
:return:
"""
...

def is_ready(self) -> bool:
"""
A healthcheck, making sure the container is properly set up.
:return: True, if ready to proceed, False otherwise
"""
...


class Server(Container, abc.ABC):
pass


class Client(Container, abc.ABC):
"""
An abstract client of the selected engine.
"""

def load_data(self, filename: Text):
"""
Loads the data with a provided filename into the selected search engine.
This is engine-specific operation, that has the possibility to
:param filename: a relative path from the dataset directory
:return:
"""
...


class Backend:
"""
A base class for all the possible benchmark backends.
"""

def __init__(self, root_dir: Union[PathLike]):
self.root_dir = root_dir if isinstance(root_dir, Path) else Path(root_dir)
self.temp_dir = None

def __enter__(self):
self.temp_dir = tempfile.TemporaryDirectory()
self.temp_dir.__enter__()
return self

def __exit__(self, exc_type, exc_val, exc_tb):
self.temp_dir.__exit__(exc_type, exc_val, exc_tb)

def initialize_server(self, engine: Text) -> Server:
...

def initialize_client(self, engine: Text) -> Client:
...
169 changes: 169 additions & 0 deletions benchmark/backend/docker.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
import json
from dataclasses import dataclass
from pathlib import Path
from typing import Text, Union, Optional, Dict, List

from benchmark.backend import Backend, PathLike, Server, Client, Container
from docker.models import containers

import logging
import docker


logger = logging.getLogger(__name__)


@dataclass
class DockerContainerConf:
engine: Text
image: Optional[Text] = None
dockerfile: Optional[Text] = None
environment: Optional[Dict[Text, Union[Text, int, bool]]] = None
main: Optional[Text] = None
hostname: Optional[Text] = None

@classmethod
def from_file(
cls, path: Text, engine: Text, container: Text = "server"
) -> "DockerContainerConf":
with open(path, "r") as fp:
conf = json.load(fp)
return DockerContainerConf(engine=engine, **conf[container])

def dockerfile_path(self, root_dir: Path) -> Path:
"""
Calculates the absolute path to the directory containing the dockerfile,
using given root directory as a base.
:param root_dir:
:return:
"""
return root_dir / "engine" / self.engine


class DockerContainer(Container):
def __init__(
self,
container_conf: DockerContainerConf,
docker_backend: "DockerBackend",
):
self.container_conf = container_conf
self.docker_backend = docker_backend
self.container: containers.Container = None
self.volumes = []

def mount(self, source: PathLike, target: PathLike):
self.volumes.append(f"{source}:{target}")

def run(self):
# Build the dockerfile if it was provided as a container image. This is
# typically done for the clients, as they may require some custom setup
if self.container_conf.dockerfile is not None:
dockerfile_path = self.container_conf.dockerfile_path(
self.docker_backend.root_dir
)
image, logs = self.docker_backend.docker_client.images.build(
path=str(dockerfile_path),
dockerfile=self.container_conf.dockerfile,
)
self.container_conf.image = image.id
logger.info(
"Built %s into a Docker image %s",
self.container_conf.dockerfile,
image.id,
)

# Create the container either using the image or dockerfile, if that was
# provided. The dockerfile has a preference over the image name.
logger.debug("Running a container using image %s", self.container_conf.image)
self.container = self.docker_backend.docker_client.containers.run(
self.container_conf.image,
detach=True,
volumes=self.volumes,
environment=self.container_conf.environment,
hostname=self.container_conf.hostname,
network=self.docker_backend.network.name,
)

# TODO: remove the image on exit

def logs(self):
for log_entry in self.container.logs(stream=True, follow=True):
yield log_entry

def is_ready(self) -> bool:
# TODO: implement the healthcheck
return True


class DockerServer(Server, DockerContainer):
pass


class DockerClient(Client, DockerContainer):
def load_data(self, filename: Text):
command = f"{self.container_conf.main} load {filename}"
_, generator = self.container.exec_run(command, stream=True)
return generator


class DockerBackend(Backend):
"""
A Docker based backend for the benchmarks, using separate containers for
server and client/s.
"""

NETWORK_NAME = "vector-benchmark"

def __init__(
self,
root_dir: Union[PathLike],
docker_client: Optional[docker.DockerClient] = None,
):
super().__init__(root_dir)
if docker_client is None:
docker_client = docker.from_env()
self.docker_client = docker_client
self.containers: List[DockerContainer] = []

def __enter__(self):
super().__enter__()
self.network = self.docker_client.networks.create(self.NETWORK_NAME)
# self.data_volume = self.docker_client.volumes.create()
return self

def __exit__(self, exc_type, exc_val, exc_tb):
super().__exit__(exc_type, exc_val, exc_tb)

# Kill all the containers on the context manager exit, so there are no
# orphaned containers once the benchmark is finished
for container in self.containers:
container.container.kill()

# Remove the data volume as well, so there won't be any volume left
# self.data_volume.remove()

# Finally get rid of the network as well
self.network.remove()

def initialize_server(self, engine: Text) -> Server:
server_conf = DockerContainerConf.from_file(
self.root_dir / "engine" / engine / "config.json",
engine=engine,
container="server",
)
logger.info("Initializing %s server: %s", engine, server_conf)
server = DockerServer(server_conf, self)
self.containers.append(server)
return server

def initialize_client(self, engine: Text) -> Client:
# TODO: Create a docker volume so the data is available on client instances
client_conf = DockerContainerConf.from_file(
self.root_dir / "engine" / engine / "config.json",
engine=engine,
container="client",
)
logger.info("Initializing %s client: %s", engine, client_conf)
client = DockerClient(client_conf, self)
self.containers.append(client)
return client
Loading

0 comments on commit 14943b9

Please sign in to comment.