This repository contains ESPBench, the enterprise streaming benchmark. It allows comparing data stream processing systems (DSPSs) and architectures in an enterprise context, i.e. in an environemnt where streaming data is integrated with existing, historical business data. This repository contains the ESPBench toolkit, example configurations, and an example query implementation using Apache Beam.
For further details, see: Hesse, Guenter, et al. "ESPBench: The Enterprise Stream Processing Benchmark", ACM/SPEC International Conference on Performance Engineering (ICPE), 2021 and the ESPBench example implementation results.
1. ESPBench Architecture
2. ESPBench Process
3. Structure of the Project
4. ESPBench Setup and Execution
5. ESPBench Results
The overall architecture is visualized in the image below:
Input data is sent to Apache Kafka by the data sender tool, which is part of this repository. The DSPS runs the queries. It gets the data from Apache Kafka as input as well as from the enterprise database management system (DBMS) when required by the query. After the configured time of the benchmark run is over, the validator and result calculator, which are also part of this repository, can check the query result correctness and compute the benchmark results, i.e., the latencies.
A brief process overview of ESPBench is visualized in the activity diagram below:
The entire process is automated using Ansible scripts.
ESPBench
│ README.md
│ .gitignore
│ .gitlab-ci.yml
│ build.sbt
│ scalastyle-config.xml
│
└───ci
│ │ Dockerfile
│
└───implementation
│ └───beam
│ └───...
│
└───project
│ │ build.properties
│ │ Dependencies.scala
│
└───tools
└───commons
│ └───src
│ └───...
│ │ commons.conf
│
└───configuration
│ └───group_vars
│ └───plays
│ └───roles
│ └───...
│ │ ansible.cfg
│ │ hosts
│
└───datasender
│ └───src
│ └───...
│ │ datasender.conf
│
└───tpc-c_gen
│ └───src
│ └───...
│ │ tpc-c.properties
│
└───util
│ └───...
│
└───validator
└───...
You find brief instructions in the following. If you are looking for a more detailed setup and execution description, please have a look at this file.
All steps are tested on Ubuntu servers.
- Create a user
benchmarkeron all involved machines that has sudo access - Ensure that this user can connect to all machines via ssh w/o password, e.g., through adding the ssh public key to the
authorized_keysfiles - Install
Apache Kafkaand the DSPSs to be tested under/opt/- you can find example configurations in thetools/configurationsdirectory - Make Apache Kafka a service:
- Run
sudo apt install policykit-1 - Copy
tools/configuration/kafka/etc/init.d/kafkato/etc/init.d/on the Apache Kafka servers - Run
update-rc.d kafka defaults
- Run
- Install
PostgreSQLon one server (you can use another DBMS, however, that requires some adaptions in the tools) - Create the directory
Benchmarksin the home directory of the userbenchmarkerand clone the repository into this folder - The project can be built using
sbt assembly
- Define the Apache Kafka topic prefix and the benchmark run number, which will have an impact on the Apache Kafka topic names that are going to be created by the ansible scripts
- Define the query you want to execute (config for each query in comment)
- Define the sending interval, which determines the pause between sending two records - a pause of, e.g., 1,000,000ns would result in an input rate of 1,000 records/s
- Define the benchmark duration
- Define the Apache Kafka bootstrap servers and zookeeper servers
- After cloning the repository and setting up the systems, you change to the
tools/configurationdirectory and start the benchmark (as shown in the activity diagram above) from here, e.g., viaansible-playbook -vvvv plays/benchmark-runner-beam.ymlfor the example implementations of this repository. The number ofvdefine the level of verbosity - Adapt the
group_vars/allfiles if needed - The directories
playsandrolescontain several ansible files, which can be adapted if needed. The starting point that represents the entire process isplays/benchmark-runner-beam.ymlfor the example implementation. These scripts also contain information about, e.g., how to start the data sender or the data generation. - The
hostsfile needs to be edited, i.e, the servers' IP addresses need to be entered
- Input data is taken from DEBS 2012 Grand Challenge, which can be downloaded from ftp://ftp.mi.fu-berlin.de/pub/debs2012/
- This data file needs to be converted using the
dos2unixcommand, and duplicated, so that there are two input files. - The two files need to be extended by a machine ID using the following commands (adapt file names):
- First file:
awk 'BEGIN { FS = OFS = "\t" } { $(NF+1) = 1; print $0 }' input1.csv > output1.csv - Second file:
awk 'BEGIN { FS = OFS = "\t" } { $(NF+1) = 2; print $0 }' input2.csv > output2.csv
- First file:
- A third input file is
production_times.csv, which is generated by the TPC-C data generator that is part of this project. - The file
datasender.confcontains Apache Kafka producer configs and the location of the input data files. - The file
src/main/resources/application.confneeds the correct DBMS configuration.
The tpc-c.properties file contains the default setting WRT the number of warehouses and the data output directory. Changes of the output directory require according adaptions in ansible scripts.
The file src/main/resources/application.conf needs the correct DBMS configuration.
If you want to use the example implementation, you need to adapt at least two files accordingly:
implementation/beam/src/main/resources/beam.properties- The
beamRunnervariable inbuild.sbt
The validator will create a logs directory that will contain information about the, e.g., query result correctness, and latencies.

