Releases · ashvardanian/less_slow.cpp

This release explores high-performance graph implementations optimized for different access patterns, focusing on real-world use cases like recommendation systems and social networks. To implement those, various STL and Abseil-based containers are used to implement sparse Graph structures.
It shows:

How to use STL's polymorphic allocators?
How do we construct a hybrid nested container that propagates stateful allocators to the inner structures?
Up to 300x performance difference between good implementations in different workload patterns!

Of other neat tricks, shows:

How can the three-way comparison operator be used with std::tie?
What's the difference between std::weak_ordering and the strong one ?
Where can the [[no_unique_address]] attribute be used?

Implementation

It extends the Graph API to:

upsert_edge(from, to, weight): Inserts or updates an existing edge between two vertices.
get_edge(from, to): Retrieves the std::optional weight of the edge between two vertices.
remove_edge(from, to): If present, remove the edge between two vertices.
for_edges(from, visitor): Applies a callback to all edges starting from a vertex.
size(): Returns the graph's number of vertices and edges.
reserve(capacity): Reserves memory for the given number of vertices.
compact(): Compacts the memory layout of the graph, preparing for read-intensive workloads.

Results

On Intel Sapphire Rapids CPUs in AWS c7i instances:

------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
graph_make<std::unordered_maps>/min_time:10.000   57329425 ns     57327619 ns          245
graph_make<std::map>/min_time:10.000             109704078 ns    109697310 ns          100
graph_make<absl::flat_set>/min_time:10.000        80598043 ns     80595813 ns          174
graph_rank<std::unordered_maps>/min_time:10.000   35763632 ns     35762406 ns          392
graph_rank<std::map>/min_time:10.000              51658552 ns     51657290 ns          271
graph_rank<absl::flat_set>/min_time:10.000          236938 ns       236933 ns        59137

On AWS Graviton 4 CPUs in AWS r8g instances:

------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
graph_make<std::unordered_maps>/min_time:10.000  163664945 ns    163660572 ns           86
graph_make<std::map>/min_time:10.000             382543113 ns    382534380 ns           45
graph_make<absl::flat_set>/min_time:10.000       213277341 ns    213272284 ns           64
graph_rank<std::unordered_maps>/min_time:10.000   59579530 ns     59578435 ns          240
graph_rank<std::map>/min_time:10.000              69177429 ns     69175965 ns          191
graph_rank<absl::flat_set>/min_time:10.000          186428 ns       186430 ns        74929

Assets 2

20 Jan 09:59

ashvardanian

v0.4.0

f5786ae

v0.4: Pointer tagging 🏷️

This release provides a prototype of a memory-tagging allocator arena, documenting the complexity of using memory tagging techniques even on Linux:

Intel's Linear Address Masking
AMD's Upper Address Ignore
ARM's Top Byte Ignore
ARM's Memory Tagging Extension

Minor

Add: Intel's Linear Address Masking (ead0cd2)
Add: Pointer tagging draft (72dfc31)

Patch

Docs: Arm & AMD pointer tagging (3a4b1d7)
Fix: Avoid tagging if Arm or LA57 (5ac64d1)
Improve: Log mean allocation size (83ba346)

Assets 2

20 Jan 09:58

ashvardanian

v0.3.0

431d5f8

v0.3: Gather 🔄 Scatter

This release introduces benchmarks for gather & scatter SIMD rarely-used instructions that can be used to accelerate lookups by ~30% on current x86 and Arm machines.

Serial
AVX-512 for x86
SVE for Arm

Minor

Add: SVE gather/scatter (107b359)
Add: Serial & AVX-512 scatter/gather (089cfa0)

Patch

Improve: Timing SVE (daa55f5)
Improve: Stabilize gather timings (3fca991)

Assets 2

20 Jan 09:57

ashvardanian

v0.2.0

2d97782

v0.2: Pushing FLOPS in Assembly 🏋️‍♂️

Release: v0.2.0 [skip ci]

Minor

Add: Latency Hiding & Port Interleaving (086f8d7)
Add: AMX kernels (0cb024d)
Add: Inline Assembly kernels (89095a6)
Add: BLAS & Eigen TOPs benchmarks (28ca39b)
Add: AVX2 & low-precision AVX-512 TOPS (0a48108)
Add: i8, f16, and bf16 kernels (3f54200)
Add: Arm NEON FMAs (d0e521e)
Add: vfmadd231ps kernels (7ca3161)
Add: Assembly micro-kernels (2e71e76)

Patch

Docs: Zen4 matmul-benchmarks (2476310)
Docs: H100 Tensor Cores vs Intel (fa86663)
Fix: Illegal instruction for AMX (a7243dd)
Fix: Duplicate .global symbols (c732234)
Docs: Recommended Eigen macros (7be2d58)
Fix: Missing tops_u8_neon (d97bbfc)
Fix: Missing tops_f64_neon (4afa7e3)
Improve: Shorter TOPS names (be0c94b)

Assets 2

17 Jan 20:08

ashvardanian

v0.1.1

a4f7cda

Release v0.1.1

Release: v0.1.1 [skip ci]

Patch

Docs: Renaming small_string benchmark (1ad50d1)
Improve: Validate sorting result (77808a6)
Improve: Log bytes/sec for trigonometry (ebc67d6)
Docs: Placeholders (84e7710)
Improve: BENCHMARK_CAPTURE w/out template (d8fb261)

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch

Minor

Patch

Patch

Implementation

Results

Minor

Patch

Minor

Patch

Minor

Patch

Patch

Releases: ashvardanian/less_slow.cpp

Release v0.5.1

Patch

Release v0.5.0

Minor

Patch

Release v0.4.2

Patch

v0.4: Allocators and Sparse Graphs 🕸️

Implementation

Results

v0.4: Pointer tagging 🏷️

Minor

Patch

v0.3: Gather 🔄 Scatter

Minor

Patch

v0.2: Pushing FLOPS in Assembly 🏋️‍♂️

Minor

Patch

Release v0.1.1

Patch