This repository takes different C++ implementations of the convolution algorithm and provides fully automated benchmarks visualized as charts. It might also be used as a starting point or template for other C++ benchmark projects.
- Insights on how a classical convolution algorithm can be implemented and optimized: BenchmarkCharts.md
- A simple (6 lines of code), easy to read (no intrinsics), yet fast convolution implementation: inputPerKernelValueTransposed
- Fully automated (CLI) visualization of Catch2 benchmark results using vega-lite charts: chart/README.md
- GitHub Actions workflow for fully automated benchmarks on Linux, MacOS and Windows: continuous-integration.yml
- Renovate configuration to update CPM.cmake managed C++ dependencies: renovate.json
- A different approach to Convolution
- Visualize Catch2 benchmarks with Vega-Lite
- Keep your C++ dependencies up-to-date with Renovate & CPM
BenchmarkCharts.md contains the benchmark results visualized as bar charts.
- Needs CMake (e.g. with the Visual Studio Installer) to build the project.
- Needs node-canvas dependencies to create SVG vector graphics files.
- Needs nodejs to build the JavaScript based charts.
- Recommends Ninja as "a small build system with a focus on speed".
- Uses cpm as a "Setup-free CMake dependency management" for C++.
- Uses Catch2 as Unit-Test and Benchmark-Framework for C++.
- Uses vega-lite to visualize the benchmark results as a bar chart.
- Uses GitHub Actions to fully automate build, benchmarks and visualization.
- Uses Renovate to update the dependencies automatically.
- Run script/run-all.sh or script\run-all.bat to build and test the project, run the benchmarks and create the charts.
- Run script/benchmark-with-charts.sh or script\benchmark-with-charts.bat to run the benchmarks and generate the charts without rebuilding the project.
- Further commands and a detailed description can be found in COMMANDS.md.
- chart/COMMANDS.md describes the commands to create the charts.
The following compiler options are used to get detailed vectorization reports. They are defined in CompilerOptions.cmake.
These compile options are used as described in Auto-Vectorization in LLVM and Clang command line argument reference:
-Rpass=loop-vectorize
identifies loops that were successfully vectorized.-Rpass-missed=loop-vectorize
identifies loops that failed vectorization and indicates if vectorization was specified.-Rpass-analysis=loop-vectorize
identifies the statements that caused vectorization to fail. If in addition -fsave-optimization-record is provided, multiple causes of vectorization failure may be listed (this behavior might change in the future).-fsave-optimization-record
generate a YAML optimization record file.
These compile options are used with GCC as described in GCC Developer Options:
-fopt-info-vec-optimized
prints information when an optimization from vectorization passes is successfully.-fopt-info-vec-missed
prints information about missed optimization opportunities from vectorization passes.
These compile options are used with MSVC as described in Auto-Vectorizer Reporting Level and Auto-Parallelization and Auto-Vectorization:
-
/Qvec-report:2
outputs an informational message for loops that are vectorized and for loops that are not vectorized, together with a reason code. -
/arch:AVX
specifies the architecture for code generation (here AVX).
Godbolt Compiler Explorer is great tool to analyze small blocks of code and get insights in what the compiler does. Its ideal to try things out. The following two links show an easy yet fast convolution implementation, that works well with compiler optimization and vectorization.
- Convolution Implementation "inputPerKernelValueTransposed" with CLang 13.0.0
- Convolution Implementation "inputPerKernelValueTransposed" with CLang 10.0.0 and AVX
- Convolution Implementation "inputPerKernelValueTransposed" with MSVC
Quick C++ Benchmark is another great tool that compares the performance of two code blocks with each other.
Copy & Paste the following code snipped into Quick C++ Benchmark. It shows the difference between the kernel and the input values as outer loop with a "direct form transposed" convolution implementation.
Code Snipped to compare the iteration over the kernel values in the inner vs. the outer loop
#include <vector>
#include <span>
#include <random>
template<typename T>
std::vector<T> randomNumbers(int n, T min, T max)
{
std::vector<T> vectorWithRandomNumbers(n, T());
std::random_device randomDevice;
std::mt19937 gen(randomDevice());
std::uniform_real_distribution<T> distribution(min, max);
for (T &value : vectorWithRandomNumbers)
{
value = distribution(gen);
}
return vectorWithRandomNumbers;
}
template<typename ValueType>
void convolutionKernelInner(const std::span<const ValueType> &input, const std::span<const ValueType> &kernel, const std::span<ValueType> &output)
{
const int inputLength = input.size();
const int kernelLength = kernel.size();
for (auto inputIndex = 0; inputIndex < inputLength; ++inputIndex)
{
const auto inputValue = input[inputIndex];
for (auto kernelIndex = 0; kernelIndex < kernelLength; ++kernelIndex)
{
output[inputIndex + kernelIndex] += inputValue * kernel[kernelIndex];
}
}
}
template<typename ValueType>
void convolutionKernelOuter(const std::span<const ValueType> &input, const std::span<const ValueType> &kernel, const std::span<ValueType> &output)
{
const auto inputLength = input.size();
const auto kernelLength = kernel.size();
for (auto kernelIndex = 0; kernelIndex < kernelLength; ++kernelIndex)
{
const auto kernelValue = kernel[kernelIndex];
for (auto inputIndex = 0; inputIndex < inputLength; ++inputIndex)
{
output[inputIndex + kernelIndex] += input[inputIndex] * kernelValue;
}
}
}
static void kernelInner(benchmark::State& state) {
// Code before the loop is not measured
const auto & input = randomNumbers(16384, -1.0F, 1.0F);
const auto & kernel = randomNumbers(16, 0.0F, 1.0F);
const auto convolutionLength = input.size() + kernel.size() - 1;
auto output = std::vector<float>(convolutionLength, 0.0F);
// Code inside this loop is measured repeatedly
for (auto _ : state) {
convolutionKernelInner(std::span(input), std::span(kernel), std::span(output));
// Make sure the variable is not optimized away by compiler
benchmark::DoNotOptimize(output);
}
}
// Register the function as a benchmark
BENCHMARK(kernelInner);
static void kernelOuter(benchmark::State& state) {
// Code before the loop is not measured
const auto & input = randomNumbers(16384, -1.0F, 1.0F);
const auto & kernel = randomNumbers(16, 0.0F, 1.0F);
const auto convolutionLength = input.size() + kernel.size() - 1;
auto output = std::vector<float>(convolutionLength, 0.0F);
// Code inside this loop is measured repeatedly
for (auto _ : state) {
convolutionKernelOuter(std::span(input), std::span(kernel), std::span(output));
// Make sure the variable is not optimized away by compiler
benchmark::DoNotOptimize(output);
}
}
// Register the function as a benchmark
BENCHMARK(kernelOuter);
This repository was initially intended to explore Single Instruction Multiple Data (SIMD) in C++. Since convolution is such an essential part of filtering in digital signal processing and a central part of convolutional neuronal networks (CNN), it seemed obvious to try that first.
A complex topic like this benefits from a "experiment - evaluate" approach to get started. Catch2 is used to assure, that the implementations are correct. It is also used to benchmark their performance. Compiler Options for vectorization reports get insights into what the compiler does. Online-Tools are used to exploratory experiment with implementations. Finally, to get a visual response, vega-lite is used to create bar charts of the benchmark results.
- Efficient FIR Filter Implementation with SIMD (Jan Wilczek)
- Vectorization part1. Intro. (Denis Bakhvalov)
- Microsoftยฎ Visual Studio Compiler - Vectorizer and parallelizer messages
- Intelยฎ Programming Guidelines for Vectorization
- FIR Structures
- Permission denied for build.sh file (StackTrace)
- FIRBenchmarks (GitHub)