See https://docs.nvidia.com/cuda/cutile-python/performance.html <img width="646" height="573" alt="Image" src="https://github.com/user-attachments/assets/143672e6-52b6-4ee8-9782-96230909d0e6" /> x-ref https://github.com/JuliaGPU/cuTile.jl/pull/111#issuecomment-4040525522 iirc, the optimization hints (#25, e.g. `latency` and `allow_tma`) should also support this but I don't know if it's being used anywhere.