Skip to content

Releases: ashvardanian/less_slow.cpp

Release v0.5.1

22 Jan 14:18
Compare
Choose a tag to compare

Release: v0.5.1 [skip ci]

Patch

  • Improve: MinWarmUpTime before spread ops (f3473c3)
  • Docs: Listing Google Benchmark APIs (a19fef3)
  • Improve: Use aligned_array in more places (1539624)
  • Improve: _k suffix for constants (53f9e71)

Release v0.5.0

20 Jan 16:30
Compare
Choose a tag to compare

Release: v0.5.0 [skip ci]

Minor

  • Add: Unbundled FMA versions (5d206f1)

Patch

  • Improve: Reorder MA kernels (c3f3ce9)
  • Fix: Handling denormal values in FMA (f9b989a)
  • Docs: Graviton 4 GEMM benchmarks (108ae72)
  • Docs: Note on __bf16 on Arm (651d7e6)
  • Fix: Call tops_f64_neon_asm_kernel (44f8086)

Release v0.4.2

20 Jan 13:45
Compare
Choose a tag to compare

Release: v0.4.2 [skip ci]

Patch

  • Docs: Better highlights (c662a6c)

v0.4: Allocators and Sparse Graphs 🕸️

20 Jan 10:27
Compare
Choose a tag to compare

This release explores high-performance graph implementations optimized for different access patterns, focusing on real-world use cases like recommendation systems and social networks. To implement those, various STL and Abseil-based containers are used to implement sparse Graph structures.
It shows:

  • How to use STL's polymorphic allocators?
  • How do we construct a hybrid nested container that propagates stateful allocators to the inner structures?
  • Up to 300x performance difference between good implementations in different workload patterns!

Of other neat tricks, shows:

  • How can the three-way comparison operator be used with std::tie?
  • What's the difference between std::weak_ordering and the strong one ?
  • Where can the [[no_unique_address]] attribute be used?

Implementation

It extends the Graph API to:

  • upsert_edge(from, to, weight): Inserts or updates an existing edge between two vertices.
  • get_edge(from, to): Retrieves the std::optional weight of the edge between two vertices.
  • remove_edge(from, to): If present, remove the edge between two vertices.
  • for_edges(from, visitor): Applies a callback to all edges starting from a vertex.
  • size(): Returns the graph's number of vertices and edges.
  • reserve(capacity): Reserves memory for the given number of vertices.
  • compact(): Compacts the memory layout of the graph, preparing for read-intensive workloads.

Results

On Intel Sapphire Rapids CPUs in AWS c7i instances:

------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
graph_make<std::unordered_maps>/min_time:10.000   57329425 ns     57327619 ns          245
graph_make<std::map>/min_time:10.000             109704078 ns    109697310 ns          100
graph_make<absl::flat_set>/min_time:10.000        80598043 ns     80595813 ns          174
graph_rank<std::unordered_maps>/min_time:10.000   35763632 ns     35762406 ns          392
graph_rank<std::map>/min_time:10.000              51658552 ns     51657290 ns          271
graph_rank<absl::flat_set>/min_time:10.000          236938 ns       236933 ns        59137

On AWS Graviton 4 CPUs in AWS r8g instances:

------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
graph_make<std::unordered_maps>/min_time:10.000  163664945 ns    163660572 ns           86
graph_make<std::map>/min_time:10.000             382543113 ns    382534380 ns           45
graph_make<absl::flat_set>/min_time:10.000       213277341 ns    213272284 ns           64
graph_rank<std::unordered_maps>/min_time:10.000   59579530 ns     59578435 ns          240
graph_rank<std::map>/min_time:10.000              69177429 ns     69175965 ns          191
graph_rank<absl::flat_set>/min_time:10.000          186428 ns       186430 ns        74929

v0.4: Pointer tagging 🏷️

20 Jan 09:59
Compare
Choose a tag to compare

This release provides a prototype of a memory-tagging allocator arena, documenting the complexity of using memory tagging techniques even on Linux:

  • Intel's Linear Address Masking
  • AMD's Upper Address Ignore
  • ARM's Top Byte Ignore
  • ARM's Memory Tagging Extension

Minor

  • Add: Intel's Linear Address Masking (ead0cd2)
  • Add: Pointer tagging draft (72dfc31)

Patch

  • Docs: Arm & AMD pointer tagging (3a4b1d7)
  • Fix: Avoid tagging if Arm or LA57 (5ac64d1)
  • Improve: Log mean allocation size (83ba346)

v0.3: Gather 🔄 Scatter

20 Jan 09:58
Compare
Choose a tag to compare

This release introduces benchmarks for gather & scatter SIMD rarely-used instructions that can be used to accelerate lookups by ~30% on current x86 and Arm machines.

  • Serial
  • AVX-512 for x86
  • SVE for Arm

Minor

  • Add: SVE gather/scatter (107b359)
  • Add: Serial & AVX-512 scatter/gather (089cfa0)

Patch

  • Improve: Timing SVE (daa55f5)
  • Improve: Stabilize gather timings (3fca991)

v0.2: Pushing FLOPS in Assembly 🏋️‍♂️

20 Jan 09:57
Compare
Choose a tag to compare

Release: v0.2.0 [skip ci]

Minor

  • Add: Latency Hiding & Port Interleaving (086f8d7)
  • Add: AMX kernels (0cb024d)
  • Add: Inline Assembly kernels (89095a6)
  • Add: BLAS & Eigen TOPs benchmarks (28ca39b)
  • Add: AVX2 & low-precision AVX-512 TOPS (0a48108)
  • Add: i8, f16, and bf16 kernels (3f54200)
  • Add: Arm NEON FMAs (d0e521e)
  • Add: vfmadd231ps kernels (7ca3161)
  • Add: Assembly micro-kernels (2e71e76)

Patch

  • Docs: Zen4 matmul-benchmarks (2476310)
  • Docs: H100 Tensor Cores vs Intel (fa86663)
  • Fix: Illegal instruction for AMX (a7243dd)
  • Fix: Duplicate .global symbols (c732234)
  • Docs: Recommended Eigen macros (7be2d58)
  • Fix: Missing tops_u8_neon (d97bbfc)
  • Fix: Missing tops_f64_neon (4afa7e3)
  • Improve: Shorter TOPS names (be0c94b)

Release v0.1.1

17 Jan 20:08
Compare
Choose a tag to compare

Release: v0.1.1 [skip ci]

Patch

  • Docs: Renaming small_string benchmark (1ad50d1)
  • Improve: Validate sorting result (77808a6)
  • Improve: Log bytes/sec for trigonometry (ebc67d6)
  • Docs: Placeholders (84e7710)
  • Improve: BENCHMARK_CAPTURE w/out template (d8fb261)