-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Poor performance with WarpReduction #19868
Comments
The tweak to collapse dims prevents a compilation timeout, but it has horrible effects on the runtime performance. When there are multiple reduction ops and it goes down warp reduction, the dispatch has to be in a very specific state to have good results. Otherwise, compilation times out or the compiled dispatch is VERY slow (3x total sdxl runtime). See: iree-org#19868 I found that there are a few sdxl instances of 1 = op with multiple uses 2 = consumer of "1" (transpose) 3 = consumer of "2" (bit extend) However, there is a reshape that will get stuck between 1-2 or 2-3 depending on which pass you look at (maybe always 2-3). 1-2 could be fused with multi-use fusion. Signed-off-by: Ian Wood <[email protected]>
@pashu123 can you help look into this issue. I know you were trying to deprecate the warp reduction pipeline and use tile and fuse. That is the better overall solution, but if there is a quick fix with the warp reduction pipeline to get past this, it would be great. |
sure! |
Hi @IanWood1, The problem here is that since the size/rank of the first |
Following up from the discussion earlier: Here's a smaller example that produces the same problem: util.func public @test1(%arg0: tensor<32x102400xf32>, %arg1: tensor<32x10x10240xf32>) -> tensor<32x10x10240xf32> {
%0 = tensor.empty() : tensor<32xf32>
%1 = tensor.empty() : tensor<32x10x10240xf32>
%2 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"]} ins(%arg0 : tensor<32x102400xf32>) outs(%0 : tensor<32xf32>) {
^bb0(%in: f32, %out: f32):
%4 = arith.addf %in, %out : f32
linalg.yield %4 : f32
} -> tensor<32xf32>
%3 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%arg1, %2 : tensor<32x10x10240xf32>, tensor<32xf32>) outs(%1 : tensor<32x10x10240xf32>) {
^bb0(%in: f32, %in_0: f32, %out: f32):
%4 = arith.addf %in, %in_0 : f32
linalg.yield %4 : f32
} -> tensor<32x10x10240xf32>
util.return %3 : tensor<32x10x10240xf32>
} iree-compile repro.mlir --iree-hal-target-backends=rocm --iree-hip-target=gfx1100 --iree-dispatch-creation-enable-aggressive-fusion -o /dev/null The issue is that the lowering config is set on the reduction op and then propagated to the consumer. E.g. %7 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"]} ins(%3 : tensor<32x102400xf32>) outs(%6 : tensor<32xf32>) attrs =
{lowering_config = #iree_codegen.lowering_config<tile_sizes = [[1], [0, 4096]]>} {
^bb0(%in: f32, %out: f32):
%9 = arith.addf %in, %out : f32
linalg.yield %9 : f32
} -> tensor<32xf32>
%8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%4, %7 : tensor<32x10x10240xf32>, tensor<32xf32>) outs(%5 : tensor<32x10x10240xf32>) attrs =
{lowering_config = #iree_codegen.lowering_config<tile_sizes = [[1], [0, 4096]]>} {
^bb0(%in: f32, %in_0: f32, %out: f32):
%9 = arith.addf %in, %in_0 : f32
linalg.yield %9 : f32
} -> tensor<32x10x10240xf32> But this leaves the dim of size @pashu123 do you think it would be possible to set a separate config on the consumer? Otherwise, it seems like we shouldn't be forming these dispatches and it has been luck that the reduction tiling has matched the parallel consumer. Also, a bit of a tangent but I noticed that it was common for there to be doubly nested scf.for loops (from the tiled consumer) so I tried |
What happened?
I made some small changes to dispatch creation and was testing SDXl but found that compilation times were excessively long. I found that there is a dispatch generating a bunch of ops and has a large allocation. Its only slightly different from the previous dispatch, with only the number of loops being different.
I tried to collapse the loops as much as possible but that didn't seem to fix the problem. It seems the pipeline needs a very specific configuration of loops to work well.
Steps to reproduce your issue
iree-compile
What component(s) does this issue relate to?
No response
Version information
36e7593
Additional context
No response
The text was updated successfully, but these errors were encountered: