Skip to content

stevenewald/TPU-From-Scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scratch-Built TPU on XC7A35T FPGA

Built from scratch in 3 months, this FP32 GEMM accelerator on a tiny 2011 FPGA (Artix-7 XC7A35T) outperforms an optimized matrix multiplication library on a quad-core Intel CPU from 2014.

Why it matters

The XC7A35T FPGA has limited resources (33k LUTs, 180 DSPs, 64 KiB BRAM) and costs less than $100 new. Despite this, the design achieves high-speed PCIe Gen2 ×4 DMA with 1.6 GB/s sustained bandwidth and sub-2 µs kernel launch latency. It supports arbitrary matrix sizes at runtime through automatic tiling.

Performance

Tested across 10,000 randomized matrix multiplications, the FPGA completes FP32 GEMM operations in roughly half the runtime of OpenBLAS running on a quad-core Intel i7-4790, while consuming only ~3 W compared to the CPU's ~110 W, delivering nearly 37 times the performance-per-watt.

Technical Details

The design features an 8×8 systolic array, double-buffered on-chip SRAM, a custom Linux character driver, and a straightforward header-only C++ API. Verification includes over 10,000 simulation cycles per commit, with a streamlined single-command Vivado workflow.

This project shows that careful co-design of computation, memory, and interfacing enables even older, low-end FPGAs to beat CPUs on dense linear algebra workloads.

About

TPU-like tensor accelerator built with an FPGA over PCIe with a Linux driver

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published