Fatal error of resnet18 example in README #77

HahaLan97 · 2024-05-14T15:48:09Z

As title I've run into this error when trying to use the scaleflow-pytorch-pipeline on the generated mlir code.

samples/pytorch/resnet18/resnet18.mlir:314:11: error: 'memref.copy' op requires the same shape for all operands
    %58 = linalg.generic {indexing_maps = [#map5, #map1], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%57#1 : tensor<1x1xf32>) outs(%57#0 : tensor<1x512x1x1xf32>) {
          ^
samples/pytorch/resnet18/resnet18.mlir:314:11: note: see current operation: "memref.copy"(%52, %51) : (memref<1x1xf32>, memref<1x512x1x1xf32>) -> ()

I add the debug-point option into the pipeline and trace this error to SimplifyCopyPass, in which a memref.copy will be created using in and out from linalg.generic. And as you can see above, they are not in the same shape.
So I wonder the issue behind this is either the GenericOp is wrong or the pass is. Or maybe even because of wrong version of torch-mlir?

I hope the contributors of this repo could help ;D @hanchenye @signorgelato @jeonghm9764

The text was updated successfully, but these errors were encountered:

HahaLan97 · 2024-05-15T14:18:04Z

I found out that from the beginning, which I mean the mlir program generated from torch-mlir, has this generic op, which input and output are in different shapes as this:

%58 = linalg.generic {indexing_maps = [#map5, #map1], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%57#1 : tensor<1x1xf32>) outs(%57#0 : tensor<1x512x1x1xf32>) {
    ^bb0(%in: f32, %out: f32):
      %65 = arith.divf %out, %in : f32
      linalg.yield %65 : f32
    } -> tensor<1x512x1x1xf32>

But because the sizes are both 1, so it'll apply the pattern SplitElementwiseGenericOp in the simplify-copy pass, which will definitely cause the error due to impossible creating of memref::CopyOp.
However, you're using applyPatternsAndFoldGreedily, so I added one check in the pattern. If the shapes are indeed different, this pattern will fail and the pass still goes on. Then it works perfectly with generating the correct affine loops.

Furthermore, the resnet18 stucked at the pass func-preprocess, because of the patterns aiming to convert two integer arithmetic operations, i.e. arith.addi and arith.muli. The cause of these is because you use:

if (auto lhs = add.getLhs().getDefiningOp<arith::ConstantIndexOp>(); isValidDim(add.getRhs()))

here, which will only examine if isValidDim is true. Change to this fix the error:

auto lhs = add.getLhs().getDefiningOp<arith::ConstantIndexOp>();
if (lhs != nullptr && isValidDim(add.getRhs())) {
    // code
}

The other three CNNs don't have the same problem as resnet18 because they don't have such GenericOp from torch-mlir and there is no integer op. I put all the changes in my forked repo and opened this PR, hope it looks good to you.

P.S.

in each python program, change torch_mlir.compile to torch_mlir.torchscript.compile. Don't know if this is a versioning issue.
a minor change: In .gitmodules you're using the ssh link, which doesn't work on my side, so I changed it to https.
add scalehls-lsp-server: very helpful tools you didn't add into the project.
a slight bug from polygeist, you can check it here.

Szzer1 · 2024-08-15T14:02:20Z

I found out that from the beginning, which I mean the mlir program generated from torch-mlir, has this generic op, which input and output are in different shapes as this:
%58 = linalg.generic {indexing_maps = [#map5, #map1], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%57#1 : tensor<1x1xf32>) outs(%57#0 : tensor<1x512x1x1xf32>) {
    ^bb0(%in: f32, %out: f32):
      %65 = arith.divf %out, %in : f32
      linalg.yield %65 : f32
    } -> tensor<1x512x1x1xf32>
But because the sizes are both 1, so it'll apply the pattern SplitElementwiseGenericOp in the simplify-copy pass, which will definitely cause the error due to impossible creating of memref::CopyOp. However, you're using applyPatternsAndFoldGreedily, so I added one check in the pattern. If the shapes are indeed different, this pattern will fail and the pass still goes on. Then it works perfectly with generating the correct affine loops.

Furthermore, the resnet18 stucked at the pass func-preprocess, because of the patterns aiming to convert two integer arithmetic operations, i.e. arith.addi and arith.muli. The cause of these is because you use:
if (auto lhs = add.getLhs().getDefiningOp<arith::ConstantIndexOp>(); isValidDim(add.getRhs()))
here, which will only examine if isValidDim is true. Change to this fix the error:
auto lhs = add.getLhs().getDefiningOp<arith::ConstantIndexOp>();
if (lhs != nullptr && isValidDim(add.getRhs())) {
    // code
}
The other three CNNs don't have the same problem as resnet18 because they don't have such GenericOp from torch-mlir and there is no integer op. I put all the changes in my forked repo and opened this PR, hope it looks good to you.

P.S.

in each python program, change torch_mlir.compile to torch_mlir.torchscript.compile. Don't know if this is a versioning issue.

a minor change: In .gitmodules you're using the ssh link, which doesn't work on my side, so I changed it to https.

add scalehls-lsp-server: very helpful tools you didn't add into the project.

a slight bug from polygeist, you can check it here.

If I modify it as you suggested, I find that new issues arise. I suspect that the current versions of torch-mlir and scalehls might not be fully compatible, but I'm not sure which version would be more appropriate. If it's convenient for you, could you share your environment setup? I would greatly appreciate it

HahaLan97 · 2024-08-22T09:34:47Z

I found out that from the beginning, which I mean the mlir program generated from torch-mlir, has this generic op, which input and output are in different shapes as this:
%58 = linalg.generic {indexing_maps = [#map5, #map1], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%57#1 : tensor<1x1xf32>) outs(%57#0 : tensor<1x512x1x1xf32>) {
    ^bb0(%in: f32, %out: f32):
      %65 = arith.divf %out, %in : f32
      linalg.yield %65 : f32
    } -> tensor<1x512x1x1xf32>
But because the sizes are both 1, so it'll apply the pattern SplitElementwiseGenericOp in the simplify-copy pass, which will definitely cause the error due to impossible creating of memref::CopyOp. However, you're using applyPatternsAndFoldGreedily, so I added one check in the pattern. If the shapes are indeed different, this pattern will fail and the pass still goes on. Then it works perfectly with generating the correct affine loops.
Furthermore, the resnet18 stucked at the pass func-preprocess, because of the patterns aiming to convert two integer arithmetic operations, i.e. arith.addi and arith.muli. The cause of these is because you use:
if (auto lhs = add.getLhs().getDefiningOp<arith::ConstantIndexOp>(); isValidDim(add.getRhs()))
here, which will only examine if isValidDim is true. Change to this fix the error:
auto lhs = add.getLhs().getDefiningOp<arith::ConstantIndexOp>();
if (lhs != nullptr && isValidDim(add.getRhs())) {
    // code
}
The other three CNNs don't have the same problem as resnet18 because they don't have such GenericOp from torch-mlir and there is no integer op. I put all the changes in my forked repo and opened this PR, hope it looks good to you.
P.S.

in each python program, change torch_mlir.compile to torch_mlir.torchscript.compile. Don't know if this is a versioning issue.

a minor change: In .gitmodules you're using the ssh link, which doesn't work on my side, so I changed it to https.

add scalehls-lsp-server: very helpful tools you didn't add into the project.

a slight bug from polygeist, you can check it here.
If I modify it as you suggested, I find that new issues arise. I suspect that the current versions of torch-mlir and scalehls might not be fully compatible, but I'm not sure which version would be more appropriate. If it's convenient for you, could you share your environment setup? I would greatly appreciate it

The versions could be most likely the issue here. I tried the "ScaleHLS 2.0" as well and the same problem still exists. I haven't work on this for quite a while, but I hope this list would help you:

(torch-mlir) ➜  ~ conda list               
# packages in environment at /home/user/anaconda3/envs/torch-mlir:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
bzip2                     1.0.8                h5eee18b_6  
ca-certificates           2024.3.11            h06a4308_0  
filelock                  3.14.0                   pypi_0    pypi
fsspec                    2024.3.1                 pypi_0    pypi
jinja2                    3.1.4                    pypi_0    pypi
ld_impl_linux-64          2.38                 h1181459_1  
libffi                    3.4.4                h6a678d5_1  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libstdcxx-ng              11.2.0               h1234567_1  
libuuid                   1.41.5               h5eee18b_0  
markupsafe                2.1.5                    pypi_0    pypi
mpmath                    1.3.0                    pypi_0    pypi
ncurses                   6.4                  h6a678d5_0  
networkx                  3.3                      pypi_0    pypi
numpy                     1.26.4                   pypi_0    pypi
openssl                   3.0.13               h7f8727e_1  
packaging                 24.0                     pypi_0    pypi
pillow                    10.3.0                   pypi_0    pypi
pip                       24.0            py311h06a4308_0  
python                    3.11.9               h955ad1f_0  
readline                  8.2                  h5eee18b_0  
setuptools                69.5.1          py311h06a4308_0  
sqlite                    3.45.3               h5eee18b_0  
sympy                     1.12                     pypi_0    pypi
tk                        8.6.14               h39e8969_0  
torch                     2.4.0.dev20240505+cpu          pypi_0    pypi
torch-mlir                20240513.100             pypi_0    pypi
torchvision               0.19.0.dev20240505+cpu          pypi_0    pypi
typing-extensions         4.11.0                   pypi_0    pypi
tzdata                    2024a                h04d1e81_0  
wheel                     0.43.0          py311h06a4308_0  
xz                        5.4.6                h5eee18b_1  
zlib                      1.2.13               h5eee18b_1

And python version is 3.11.9 ;)

HahaLan97 mentioned this issue May 15, 2024

Fix fatal error of resnet18 example. #78

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fatal error of resnet18 example in README #77

Fatal error of resnet18 example in README #77

HahaLan97 commented May 14, 2024 •

edited

Loading

HahaLan97 commented May 15, 2024

Szzer1 commented Aug 15, 2024

HahaLan97 commented Aug 22, 2024 •

edited

Loading

Fatal error of resnet18 example in README #77

Fatal error of resnet18 example in README #77

Comments

HahaLan97 commented May 14, 2024 • edited Loading

HahaLan97 commented May 15, 2024

Szzer1 commented Aug 15, 2024

HahaLan97 commented Aug 22, 2024 • edited Loading

HahaLan97 commented May 14, 2024 •

edited

Loading

HahaLan97 commented Aug 22, 2024 •

edited

Loading