Initial commit

qdrant · Jul 12, 2022 · 14943b9 · 14943b9
commit 14943b9
Show file tree

Hide file tree

Showing 16 changed files with 1,125 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+.idea/
+__pycache__
+*.pyc
diff --git a/README.md b/README.md
@@ -0,0 +1,137 @@
+# vector-db-benchmark
+
+There are various vector search engines available, and each of them may offer
+a different set of features and efficiency. But how do we measure the 
+performance? There is no clear definition and in a specific case you may worry 
+about a specific thing, while not paying much attention to other aspects. This
+project is a general framework for benchmarking different engines under the 
+same hardware constraints, so you can choose what works best for you.
+
+Running any benchmark requires choosing an engine, a dataset and the scenario 
+against which it should be tested. 
+
+## TL;DR
+
+```shell
+python main.py \
+  --engine qdrant-0.8.4 \
+  --scenario scenario.load.MeasureLoadTimeSingleClient \
+  --dataset random-100
+```
+
+Will execute the benchmark scenario enclosed in a 
+`scenario.load.MeasureLoadTimeSingleClient` class and use a `random-100` 
+dataset. All the operation will be launched on a `qdrant-0.8.4` engine.
+
+Expected output should look like following:
+
+```shell
+mean(load::time) = 0.0015927800000000007
+```
+
+### Backend
+
+A specific way of managing the containers. Right now only Docker, but might be 
+Docker Swarm or Kubernetes, so the benchmark is not executed on a single 
+machine, but on several servers.
+
+### Engine
+
+There are various vector search projects available. Some of them are just pure
+libraries (like FAISS or Annoy) and they offer great performance, but doesn't 
+fit well any production systems. Those could be also benchmarked, however the
+primary focus is on vector databases using client-server architecture.
+
+All the engine configurations are kept in `./engine` subdirectories.
+
+Each engine has its own configuration defined in `config.json` file:
+
+```json
+{
+  "server": {
+    "image": "qdrant/qdrant:v0.8.4",
+    "hostname": "qdrant_server",
+    "environment": {
+      "DEBUG": true
+    }
+  },
+  "client": {
+    "dockerfile": "client.Dockerfile",
+    "main": "python cmd.py"
+  }
+}
+```
+
+- Either `image` or `dockerfile` has to be defined, similar to
+  `docker-compose.yaml` file. The `dockerfile` has a precedence over `image`
+- The `main` parameter points to a main client script which takes parameters.
+  Those parameters define the operations to perform with a client library.
+
+#### Server
+
+The server is a process, or a bunch of processes, responsible for creating 
+vector indexes and handling all the user requests. It may be run on a single 
+machine, or in case of some engines using the distributed mode (**in the future**).
+
+#### Client
+
+A client process performing all the operations, as it would be typically done in 
+any client-server based communication. There might be several clients launched
+in parallel and each of them might be using part of the data. The number of 
+clients depends on the scenario.
+
+Each client has to define a main script which takes some parameters and allow 
+performing typical CRUD-like operations. For now there is only one operation 
+supported:
+
+- `load [path-to-file]`
+
+If the scenario attempts to load the data from a given file, then it will call
+the following command:
+
+`python cmd.py load vectors.jsonl`
+
+The main script has to handle the conversion and load operations.
+
+By introducing a main script, we can allow using different client libraries, if 
+available, so there is no assumption about the language used, as long as it can
+accept parameters.
+
+### Dataset
+
+Consists of vectors and/or payloads. Scenario decides what to do with the data.
+
+## Metrics
+
+Metrics are being measured by the clients themselves and displayed on stdout.
+The benchmark will collect all the metrics and display some statistics at the
+end of each test.
+
+All the displayed metrics should be printed in the following way:
+
+```shell
+phase::kpi_name = 0.242142
+```
+
+Where `0.242142` is a numerical value specific for the `kpi_name`. In the 
+simplest case that might be a time spent in a specific operation, like:
+
+```
+load::time = 0.0052424
+```
+
+## Open topics
+
+1. The list of supported KPIs should be still established and implemented by 
+   every single engine, so can be tracked in all the benchmark scenarios.
+2. What should be the format supported in the datasets? JSON lines are cross
+   language and platform, what makes them easy to be parsed to whatever format
+   a specific engine support.
+3. Should the scenario be tightly-coupled with the dataset or allow using 
+   different datasets? For simpler cases that may work, but there might be some
+   specific problems that won't be possible for each dataset.
+4. How do we handle engine errors? 
+5. Dataset should be also represented by a class instance: 
+   - that will give a possibility to not assume the filenames in scenario
+   - it will be easier to deal with paths
+   The dataset should also have a file-based config, like engine.
diff --git a/benchmark/__init__.py b/benchmark/__init__.py
diff --git a/benchmark/backend/__init__.py b/benchmark/backend/__init__.py
@@ -0,0 +1,70 @@
+import abc
+import tempfile
+from pathlib import Path
+from typing import Text, Union
+
+PathLike = Union[Text, Path]
+
+
+class Container(abc.ABC):
+    """
+    An abstraction over a container, which is a machine running either the
+    server or client of the engine.
+    """
+
+    def run(self):
+        """
+        Start the container using the backend
+        :return:
+        """
+        ...
+
+    def is_ready(self) -> bool:
+        """
+        A healthcheck, making sure the container is properly set up.
+        :return: True, if ready to proceed, False otherwise
+        """
+        ...
+
+
+class Server(Container, abc.ABC):
+    pass
+
+
+class Client(Container, abc.ABC):
+    """
+    An abstract client of the selected engine.
+    """
+
+    def load_data(self, filename: Text):
+        """
+        Loads the data with a provided filename into the selected search engine.
+        This is engine-specific operation, that has the possibility to
+        :param filename: a relative path from the dataset directory
+        :return:
+        """
+        ...
+
+
+class Backend:
+    """
+    A base class for all the possible benchmark backends.
+    """
+
+    def __init__(self, root_dir: Union[PathLike]):
+        self.root_dir = root_dir if isinstance(root_dir, Path) else Path(root_dir)
+        self.temp_dir = None
+
+    def __enter__(self):
+        self.temp_dir = tempfile.TemporaryDirectory()
+        self.temp_dir.__enter__()
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.temp_dir.__exit__(exc_type, exc_val, exc_tb)
+
+    def initialize_server(self, engine: Text) -> Server:
+        ...
+
+    def initialize_client(self, engine: Text) -> Client:
+        ...
diff --git a/benchmark/backend/docker.py b/benchmark/backend/docker.py
@@ -0,0 +1,169 @@
+import json
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Text, Union, Optional, Dict, List
+
+from benchmark.backend import Backend, PathLike, Server, Client, Container
+from docker.models import containers
+
+import logging
+import docker
+
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class DockerContainerConf:
+    engine: Text
+    image: Optional[Text] = None
+    dockerfile: Optional[Text] = None
+    environment: Optional[Dict[Text, Union[Text, int, bool]]] = None
+    main: Optional[Text] = None
+    hostname: Optional[Text] = None
+
+    @classmethod
+    def from_file(
+        cls, path: Text, engine: Text, container: Text = "server"
+    ) -> "DockerContainerConf":
+        with open(path, "r") as fp:
+            conf = json.load(fp)
+            return DockerContainerConf(engine=engine, **conf[container])
+
+    def dockerfile_path(self, root_dir: Path) -> Path:
+        """
+        Calculates the absolute path to the directory containing the dockerfile,
+        using given root directory as a base.
+        :param root_dir:
+        :return:
+        """
+        return root_dir / "engine" / self.engine
+
+
+class DockerContainer(Container):
+    def __init__(
+        self,
+        container_conf: DockerContainerConf,
+        docker_backend: "DockerBackend",
+    ):
+        self.container_conf = container_conf
+        self.docker_backend = docker_backend
+        self.container: containers.Container = None
+        self.volumes = []
+
+    def mount(self, source: PathLike, target: PathLike):
+        self.volumes.append(f"{source}:{target}")
+
+    def run(self):
+        # Build the dockerfile if it was provided as a container image. This is
+        # typically done for the clients, as they may require some custom setup
+        if self.container_conf.dockerfile is not None:
+            dockerfile_path = self.container_conf.dockerfile_path(
+                self.docker_backend.root_dir
+            )
+            image, logs = self.docker_backend.docker_client.images.build(
+                path=str(dockerfile_path),
+                dockerfile=self.container_conf.dockerfile,
+            )
+            self.container_conf.image = image.id
+            logger.info(
+                "Built %s into a Docker image %s",
+                self.container_conf.dockerfile,
+                image.id,
+            )
+
+        # Create the container either using the image or dockerfile, if that was
+        # provided. The dockerfile has a preference over the image name.
+        logger.debug("Running a container using image %s", self.container_conf.image)
+        self.container = self.docker_backend.docker_client.containers.run(
+            self.container_conf.image,
+            detach=True,
+            volumes=self.volumes,
+            environment=self.container_conf.environment,
+            hostname=self.container_conf.hostname,
+            network=self.docker_backend.network.name,
+        )
+
+        # TODO: remove the image on exit
+
+    def logs(self):
+        for log_entry in self.container.logs(stream=True, follow=True):
+            yield log_entry
+
+    def is_ready(self) -> bool:
+        # TODO: implement the healthcheck
+        return True
+
+
+class DockerServer(Server, DockerContainer):
+    pass
+
+
+class DockerClient(Client, DockerContainer):
+    def load_data(self, filename: Text):
+        command = f"{self.container_conf.main} load {filename}"
+        _, generator = self.container.exec_run(command, stream=True)
+        return generator
+
+
+class DockerBackend(Backend):
+    """
+    A Docker based backend for the benchmarks, using separate containers for
+    server and client/s.
+    """
+
+    NETWORK_NAME = "vector-benchmark"
+
+    def __init__(
+        self,
+        root_dir: Union[PathLike],
+        docker_client: Optional[docker.DockerClient] = None,
+    ):
+        super().__init__(root_dir)
+        if docker_client is None:
+            docker_client = docker.from_env()
+        self.docker_client = docker_client
+        self.containers: List[DockerContainer] = []
+
+    def __enter__(self):
+        super().__enter__()
+        self.network = self.docker_client.networks.create(self.NETWORK_NAME)
+        # self.data_volume = self.docker_client.volumes.create()
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        super().__exit__(exc_type, exc_val, exc_tb)
+
+        # Kill all the containers on the context manager exit, so there are no
+        # orphaned containers once the benchmark is finished
+        for container in self.containers:
+            container.container.kill()
+
+        # Remove the data volume as well, so there won't be any volume left
+        # self.data_volume.remove()
+
+        # Finally get rid of the network as well
+        self.network.remove()
+
+    def initialize_server(self, engine: Text) -> Server:
+        server_conf = DockerContainerConf.from_file(
+            self.root_dir / "engine" / engine / "config.json",
+            engine=engine,
+            container="server",
+        )
+        logger.info("Initializing %s server: %s", engine, server_conf)
+        server = DockerServer(server_conf, self)
+        self.containers.append(server)
+        return server
+
+    def initialize_client(self, engine: Text) -> Client:
+        # TODO: Create a docker volume so the data is available on client instances
+        client_conf = DockerContainerConf.from_file(
+            self.root_dir / "engine" / engine / "config.json",
+            engine=engine,
+            container="client",
+        )
+        logger.info("Initializing %s client: %s", engine, client_conf)
+        client = DockerClient(client_conf, self)
+        self.containers.append(client)
+        return client