SystemVerilog implementation of Nvidia's SIMT CUDA, Hybrid-Precision Tensor Core, and Google's Systolic Array TPU MXU GEMM Operations. These modules are by no means really emulating the actual microarchitecture executing CUDA/Tensor Core instructions, instead they're simply performing the same operation for direct usage in FPGA designs.
Go check out my work on the Vortex GPGPU's Tensor Core Unit (TCU) extension's DRL Floating Point RTL backend for a more optimized, realistic microarchitecture implementation.