Skip to content
This repository has been archived by the owner on Mar 6, 2023. It is now read-only.

about winograd batched MM performance #7

Open
janboeye opened this issue Apr 21, 2018 · 1 comment
Open

about winograd batched MM performance #7

janboeye opened this issue Apr 21, 2018 · 1 comment

Comments

@janboeye
Copy link

hi, @merrymercy
I am working on winograd on cuda.
I found that batched MM in your winograd is slow in nvida architecure. I guest this is because when C is large, it could not use parallel power of GPU.

Do you have any idea about this part?

Thanks

@merrymercy
Copy link
Owner

merrymercy commented Apr 21, 2018

I am also working on cuda winograd.

  1. The schedule for mali gpu cannot be used for nvidia gpu. The main difference is the usage of shared memory. You should implement totally different schedule for both transformation and batch MM. For batch gemm, you can see https://github.com/dmlc/tvm/tree/master/topi/recipe/gemm for example.
  2. For nvidia gpu, if we want to get the best performance, we cannot re-layout the data several times like what we do on mali. Because some stages can be memory bounded on nvidia's gpu. (NVIDIA gpu vs mali gpu, peak FLOPS is about 50~200x, but memory bandwith is only 10x). According to the original paper, we should fuse the transform and batch gemm into a block.
  3. Actually I cannot figure out how to fuse them to get the best performance. The open source code from that paper (neon library) is in asm and I cannot read it. Now I only have some preliminary results. For inference, if we do kernel transformation in advance, our kernel can beat cudnn's best winograd when the kernel tensor is large (such as last few layers in resnet)

What's your background of cuda? It helps a lot if your team can contribute a fast (fused) winograd kernel for cuda.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants