Releases: ashvardanian/less_slow.cpp
Release v0.5.1
Release v0.5.0
Release v0.4.2
v0.4: Allocators and Sparse Graphs 🕸️
This release explores high-performance graph implementations optimized for different access patterns, focusing on real-world use cases like recommendation systems and social networks. To implement those, various STL and Abseil-based containers are used to implement sparse Graph structures.
It shows:
- How to use STL's polymorphic allocators?
- How do we construct a hybrid nested container that propagates stateful allocators to the inner structures?
- Up to 300x performance difference between good implementations in different workload patterns!
Of other neat tricks, shows:
- How can the three-way comparison operator be used with
std::tie
? - What's the difference between
std::weak_ordering
and the strong one ? - Where can the
[[no_unique_address]]
attribute be used?
Implementation
It extends the Graph API to:
upsert_edge(from, to, weight)
: Inserts or updates an existing edge between two vertices.get_edge(from, to)
: Retrieves thestd::optional
weight of the edge between two vertices.remove_edge(from, to)
: If present, remove the edge between two vertices.for_edges(from, visitor)
: Applies a callback to all edges starting from a vertex.size()
: Returns the graph's number of vertices and edges.reserve(capacity)
: Reserves memory for the given number of vertices.compact()
: Compacts the memory layout of the graph, preparing for read-intensive workloads.
Results
On Intel Sapphire Rapids CPUs in AWS c7i
instances:
------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------------
graph_make<std::unordered_maps>/min_time:10.000 57329425 ns 57327619 ns 245
graph_make<std::map>/min_time:10.000 109704078 ns 109697310 ns 100
graph_make<absl::flat_set>/min_time:10.000 80598043 ns 80595813 ns 174
graph_rank<std::unordered_maps>/min_time:10.000 35763632 ns 35762406 ns 392
graph_rank<std::map>/min_time:10.000 51658552 ns 51657290 ns 271
graph_rank<absl::flat_set>/min_time:10.000 236938 ns 236933 ns 59137
On AWS Graviton 4 CPUs in AWS r8g
instances:
------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------------
graph_make<std::unordered_maps>/min_time:10.000 163664945 ns 163660572 ns 86
graph_make<std::map>/min_time:10.000 382543113 ns 382534380 ns 45
graph_make<absl::flat_set>/min_time:10.000 213277341 ns 213272284 ns 64
graph_rank<std::unordered_maps>/min_time:10.000 59579530 ns 59578435 ns 240
graph_rank<std::map>/min_time:10.000 69177429 ns 69175965 ns 191
graph_rank<absl::flat_set>/min_time:10.000 186428 ns 186430 ns 74929
v0.4: Pointer tagging 🏷️
This release provides a prototype of a memory-tagging allocator arena, documenting the complexity of using memory tagging techniques even on Linux:
- Intel's Linear Address Masking
- AMD's Upper Address Ignore
- ARM's Top Byte Ignore
- ARM's Memory Tagging Extension
Minor
Patch
v0.3: Gather 🔄 Scatter
This release introduces benchmarks for gather & scatter SIMD rarely-used instructions that can be used to accelerate lookups by ~30% on current x86 and Arm machines.
- Serial
- AVX-512 for x86
- SVE for Arm
Minor
Patch
v0.2: Pushing FLOPS in Assembly 🏋️♂️
Release: v0.2.0 [skip ci]
Minor
- Add: Latency Hiding & Port Interleaving (086f8d7)
- Add: AMX kernels (0cb024d)
- Add: Inline Assembly kernels (89095a6)
- Add: BLAS & Eigen TOPs benchmarks (28ca39b)
- Add: AVX2 & low-precision AVX-512 TOPS (0a48108)
- Add:
i8
,f16
, andbf16
kernels (3f54200) - Add: Arm NEON FMAs (d0e521e)
- Add:
vfmadd231ps
kernels (7ca3161) - Add: Assembly micro-kernels (2e71e76)
Patch
- Docs: Zen4 matmul-benchmarks (2476310)
- Docs: H100 Tensor Cores vs Intel (fa86663)
- Fix:
Illegal instruction
for AMX (a7243dd) - Fix: Duplicate
.global
symbols (c732234) - Docs: Recommended Eigen macros (7be2d58)
- Fix: Missing
tops_u8_neon
(d97bbfc) - Fix: Missing
tops_f64_neon
(4afa7e3) - Improve: Shorter TOPS names (be0c94b)