The benchmarks in this repository don't aim to cover every topic entirely, but they help form a mindset and intuition for performance-oriented software design. It also provides an example of using some non-STL but de facto standard libraries in C++, importing them via CMake, and compiling from source. For higher-level abstractions and languages, check out
less_slow.rs
andless_slow.py
.
Much modern code suffers from common pitfalls, such as bugs, security vulnerabilities, and performance bottlenecks. University curricula often teach outdated concepts, while bootcamps oversimplify crucial software development principles.
This repository offers practical examples of writing efficient C and C++ code. It leverages C++20 features and is designed primarily for GCC and Clang compilers on Linux, though it may work on other platforms. The topics range from basic micro-kernels executing in a few nanoseconds to more complex constructs involving parallel algorithms, coroutines, and polymorphism. Some of the highlights include:
- 100x cheaper random inputs?! Discover how input generation sometimes costs more than the algorithm.
- 40x faster trigonometry: Speed-up standard library functions like
std::sin
in just 3 lines of code. - 4x faster lazy-logic with custom
std::ranges
and iterators! - Compiler optimizations beyond
-O3
: Learn about less obvious flags and techniques for another 2x speedup. - Multiplying matrices? Check how a 3x3x3 GEMM can be 70% slower than 4x4x4, despite 60% fewer ops.
- Scaling AI? Measure the gap between theoretical ALU throughput and your BLAS.
- How many if conditions are too many? Test your CPU's branch predictor with just 10 lines of code.
- Prefer recursion to iteration? Measure the depth at which your algorithm with
SEGFAULT
. - How to choose between exceptions,
std::error_code
, andstd::variant
-like wrappers? - Scaling to many cores? Learn how to use OpenMP, Intel's oneTBB, or your custom thread pool.
- How to handle JSON avoiding memory allocations? Is it easier with C++ 20 or old-school C 99 tools?
- How to properly use STL's associative containers with custom keys and transparent comparators?
- How to beat a hand-written parser with
consteval
RegEx engines? - Is the pointer size really 64 bits and how to exploit pointer-tagging?
- How many packets is UDP dropping and how to serve web requests in
io_uring
from user-space? - Scatter and Gather for 50% faster vectorized disjoint memory operations.
- How to choose between intrinsics, inline Assembly, and separate Assembly files for your performance-critical code?
- What are Encrypted Enclaves and what's the latency of Intel SGX, AMD SEV, and ARM Realm? 🔜
To read, jump to the less_slow.cpp
source file and read the code snippets and comments.
Follow the instructions below to run the code in your environment and compare it to the comments as you read through the source.
- If you are on Windows, it's recommended that you set up a Linux environment using WSL.
- If you are on MacOS, consider using the non-native distribution of Clang from Homebrew or MacPorts.
- If you are on Linux, make sure to install CMake and a recent version of GCC or Clang compilers to support C++20 features.
If you are familiar with C++ and want to review code and measurements as you read, you can clone the repository and execute the following commands.
git clone https://github.com/ashvardanian/less_slow.cpp.git # Clone the repository
cd less_slow.cpp # Change the directory
cmake -B build_release -D CMAKE_BUILD_TYPE=Release # Generate the build files
cmake --build build_release --config Release # Build the project
build_release/less_slow # Run the benchmarks
The build will pull and compile several third-party dependencies from the source:
- Google's Benchmark is used for profiling.
- Intel's oneTBB is used as the Parallel STL backend.
- Meta's libunifex is used for senders & executors.
- Eric Niebler's range-v3 replaces
std::ranges
. - Victor Zverovich's fmt replaces
std::format
. - Ash Vardanian's StringZilla replaces
std::string
. - Hana Dusíková's CTRE replaces
std::regex
. - Niels Lohmann's json is used for JSON deserialization.
- Yaoyuan Guo's yyjson for faster JSON processing.
- Google's Abseil replaces STL's associative containers.
- Lewis Baker's cppcoro implements C++20 coroutines.
To control the output or run specific benchmarks, use the following flags:
build_release/less_slow --benchmark_format=json # Output in JSON format
build_release/less_slow --benchmark_out=results.json # Save the results to a file instead of `stdout`
build_release/less_slow --benchmark_filter=std_sort # Run only benchmarks containing `std_sort` in their name
To enhance stability and reproducibility, disable Simultaneous Multi-Threading (SMT) on your CPU and use the --benchmark_enable_random_interleaving=true
flag, which shuffles and interleaves benchmarks as described here.
build_release/less_slow --benchmark_enable_random_interleaving=true
Google Benchmark supports User-Requested Performance Counters through libpmf
.
Note that collecting these may require sudo
privileges.
sudo build_release/less_slow --benchmark_enable_random_interleaving=true --benchmark_format=json --benchmark_perf_counters="CYCLES,INSTRUCTIONS"
Alternatively, use the Linux perf
tool for performance counter collection:
sudo perf stat taskset 0xEFFFEFFFEFFFEFFFEFFFEFFFEFFFEFFF build_release/less_slow --benchmark_enable_random_interleaving=true --benchmark_filter=super_sort
Educational content without memes?! Come on!