Built from scratch in 3 months, this FP32 GEMM accelerator on a tiny 2011 FPGA (Artix-7 XC7A35T) outperforms an optimized matrix multiplication library on a quad-core Intel CPU from 2014.
The XC7A35T FPGA has limited resources (33k LUTs, 180 DSPs, 64 KiB BRAM) and costs less than $100 new. Despite this, the design achieves high-speed PCIe Gen2 ×4 DMA with 1.6 GB/s sustained bandwidth and sub-2 µs kernel launch latency. It supports arbitrary matrix sizes at runtime through automatic tiling.
Tested across 10,000 randomized matrix multiplications, the FPGA completes FP32 GEMM operations in roughly half the runtime of OpenBLAS running on a quad-core Intel i7-4790, while consuming only ~3 W compared to the CPU's ~110 W, delivering nearly 37 times the performance-per-watt.
The design features an 8×8 systolic array, double-buffered on-chip SRAM, a custom Linux character driver, and a straightforward header-only C++ API. Verification includes over 10,000 simulation cycles per commit, with a streamlined single-command Vivado workflow.
This project shows that careful co-design of computation, memory, and interfacing enables even older, low-end FPGAs to beat CPUs on dense linear algebra workloads.