Scratch-Built TPU on XC7A35T FPGA

Built from scratch in 3 months, this FP32 GEMM accelerator on a tiny 2011 FPGA (Artix-7 XC7A35T) outperforms an optimized matrix multiplication library on a quad-core Intel CPU from 2014.

Why it matters

The XC7A35T FPGA has limited resources (33k LUTs, 180 DSPs, 64 KiB BRAM) and costs less than $100 new. Despite this, the design achieves high-speed PCIe Gen2 ×4 DMA with 1.6 GB/s sustained bandwidth and sub-2 µs kernel launch latency. It supports arbitrary matrix sizes at runtime through automatic tiling.

Performance

Tested across 10,000 randomized matrix multiplications, the FPGA completes FP32 GEMM operations in roughly half the runtime of OpenBLAS running on a quad-core Intel i7-4790, while consuming only ~3 W compared to the CPU's ~110 W, delivering nearly 37 times the performance-per-watt.

Technical Details

The design features an 8×8 systolic array, double-buffered on-chip SRAM, a custom Linux character driver, and a straightforward header-only C++ API. Verification includes over 10,000 simulation cycles per commit, with a streamlined single-command Vivado workflow.

This project shows that careful co-design of computation, memory, and interfacing enables even older, low-end FPGAs to beat CPUs on dense linear algebra workloads.

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
board		board
demo		demo
driver		driver
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scratch-Built TPU on XC7A35T FPGA

Why it matters

Performance

Technical Details

About

Uh oh!

Releases

Packages

Languages

stevenewald/TPU-From-Scratch

Folders and files

Latest commit

History

Repository files navigation

Scratch-Built TPU on XC7A35T FPGA

Why it matters

Performance

Technical Details

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages