Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal error of resnet18 example in README #77

Open
HahaLan97 opened this issue May 14, 2024 · 3 comments
Open

Fatal error of resnet18 example in README #77

HahaLan97 opened this issue May 14, 2024 · 3 comments

Comments

@HahaLan97
Copy link

HahaLan97 commented May 14, 2024

As title I've run into this error when trying to use the scaleflow-pytorch-pipeline on the generated mlir code.

samples/pytorch/resnet18/resnet18.mlir:314:11: error: 'memref.copy' op requires the same shape for all operands
    %58 = linalg.generic {indexing_maps = [#map5, #map1], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%57#1 : tensor<1x1xf32>) outs(%57#0 : tensor<1x512x1x1xf32>) {
          ^
samples/pytorch/resnet18/resnet18.mlir:314:11: note: see current operation: "memref.copy"(%52, %51) : (memref<1x1xf32>, memref<1x512x1x1xf32>) -> ()

I add the debug-point option into the pipeline and trace this error to SimplifyCopyPass, in which a memref.copy will be created using in and out from linalg.generic. And as you can see above, they are not in the same shape.
So I wonder the issue behind this is either the GenericOp is wrong or the pass is. Or maybe even because of wrong version of torch-mlir?

I hope the contributors of this repo could help ;D @hanchenye @signorgelato @jeonghm9764

@HahaLan97
Copy link
Author

I found out that from the beginning, which I mean the mlir program generated from torch-mlir, has this generic op, which input and output are in different shapes as this:

%58 = linalg.generic {indexing_maps = [#map5, #map1], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%57#1 : tensor<1x1xf32>) outs(%57#0 : tensor<1x512x1x1xf32>) {
    ^bb0(%in: f32, %out: f32):
      %65 = arith.divf %out, %in : f32
      linalg.yield %65 : f32
    } -> tensor<1x512x1x1xf32>

But because the sizes are both 1, so it'll apply the pattern SplitElementwiseGenericOp in the simplify-copy pass, which will definitely cause the error due to impossible creating of memref::CopyOp.
However, you're using applyPatternsAndFoldGreedily, so I added one check in the pattern. If the shapes are indeed different, this pattern will fail and the pass still goes on. Then it works perfectly with generating the correct affine loops.

Furthermore, the resnet18 stucked at the pass func-preprocess, because of the patterns aiming to convert two integer arithmetic operations, i.e. arith.addi and arith.muli. The cause of these is because you use:

if (auto lhs = add.getLhs().getDefiningOp<arith::ConstantIndexOp>(); isValidDim(add.getRhs()))

here, which will only examine if isValidDim is true. Change to this fix the error:

auto lhs = add.getLhs().getDefiningOp<arith::ConstantIndexOp>();
if (lhs != nullptr && isValidDim(add.getRhs())) {
    // code
}

The other three CNNs don't have the same problem as resnet18 because they don't have such GenericOp from torch-mlir and there is no integer op. I put all the changes in my forked repo and opened this PR, hope it looks good to you.

P.S.

  1. in each python program, change torch_mlir.compile to torch_mlir.torchscript.compile. Don't know if this is a versioning issue.
  2. a minor change: In .gitmodules you're using the ssh link, which doesn't work on my side, so I changed it to https.
  3. add scalehls-lsp-server: very helpful tools you didn't add into the project.
  4. a slight bug from polygeist, you can check it here.

@Szzer1
Copy link

Szzer1 commented Aug 15, 2024

I found out that from the beginning, which I mean the mlir program generated from torch-mlir, has this generic op, which input and output are in different shapes as this:

%58 = linalg.generic {indexing_maps = [#map5, #map1], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%57#1 : tensor<1x1xf32>) outs(%57#0 : tensor<1x512x1x1xf32>) {
    ^bb0(%in: f32, %out: f32):
      %65 = arith.divf %out, %in : f32
      linalg.yield %65 : f32
    } -> tensor<1x512x1x1xf32>

But because the sizes are both 1, so it'll apply the pattern SplitElementwiseGenericOp in the simplify-copy pass, which will definitely cause the error due to impossible creating of memref::CopyOp. However, you're using applyPatternsAndFoldGreedily, so I added one check in the pattern. If the shapes are indeed different, this pattern will fail and the pass still goes on. Then it works perfectly with generating the correct affine loops.

Furthermore, the resnet18 stucked at the pass func-preprocess, because of the patterns aiming to convert two integer arithmetic operations, i.e. arith.addi and arith.muli. The cause of these is because you use:

if (auto lhs = add.getLhs().getDefiningOp<arith::ConstantIndexOp>(); isValidDim(add.getRhs()))

here, which will only examine if isValidDim is true. Change to this fix the error:

auto lhs = add.getLhs().getDefiningOp<arith::ConstantIndexOp>();
if (lhs != nullptr && isValidDim(add.getRhs())) {
    // code
}

The other three CNNs don't have the same problem as resnet18 because they don't have such GenericOp from torch-mlir and there is no integer op. I put all the changes in my forked repo and opened this PR, hope it looks good to you.

P.S.

  1. in each python program, change torch_mlir.compile to torch_mlir.torchscript.compile. Don't know if this is a versioning issue.
  2. a minor change: In .gitmodules you're using the ssh link, which doesn't work on my side, so I changed it to https.
  3. add scalehls-lsp-server: very helpful tools you didn't add into the project.
  4. a slight bug from polygeist, you can check it here.

If I modify it as you suggested, I find that new issues arise. I suspect that the current versions of torch-mlir and scalehls might not be fully compatible, but I'm not sure which version would be more appropriate. If it's convenient for you, could you share your environment setup? I would greatly appreciate it

@HahaLan97
Copy link
Author

HahaLan97 commented Aug 22, 2024

I found out that from the beginning, which I mean the mlir program generated from torch-mlir, has this generic op, which input and output are in different shapes as this:

%58 = linalg.generic {indexing_maps = [#map5, #map1], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%57#1 : tensor<1x1xf32>) outs(%57#0 : tensor<1x512x1x1xf32>) {
    ^bb0(%in: f32, %out: f32):
      %65 = arith.divf %out, %in : f32
      linalg.yield %65 : f32
    } -> tensor<1x512x1x1xf32>

But because the sizes are both 1, so it'll apply the pattern SplitElementwiseGenericOp in the simplify-copy pass, which will definitely cause the error due to impossible creating of memref::CopyOp. However, you're using applyPatternsAndFoldGreedily, so I added one check in the pattern. If the shapes are indeed different, this pattern will fail and the pass still goes on. Then it works perfectly with generating the correct affine loops.
Furthermore, the resnet18 stucked at the pass func-preprocess, because of the patterns aiming to convert two integer arithmetic operations, i.e. arith.addi and arith.muli. The cause of these is because you use:

if (auto lhs = add.getLhs().getDefiningOp<arith::ConstantIndexOp>(); isValidDim(add.getRhs()))

here, which will only examine if isValidDim is true. Change to this fix the error:

auto lhs = add.getLhs().getDefiningOp<arith::ConstantIndexOp>();
if (lhs != nullptr && isValidDim(add.getRhs())) {
    // code
}

The other three CNNs don't have the same problem as resnet18 because they don't have such GenericOp from torch-mlir and there is no integer op. I put all the changes in my forked repo and opened this PR, hope it looks good to you.
P.S.

  1. in each python program, change torch_mlir.compile to torch_mlir.torchscript.compile. Don't know if this is a versioning issue.
  2. a minor change: In .gitmodules you're using the ssh link, which doesn't work on my side, so I changed it to https.
  3. add scalehls-lsp-server: very helpful tools you didn't add into the project.
  4. a slight bug from polygeist, you can check it here.

If I modify it as you suggested, I find that new issues arise. I suspect that the current versions of torch-mlir and scalehls might not be fully compatible, but I'm not sure which version would be more appropriate. If it's convenient for you, could you share your environment setup? I would greatly appreciate it

The versions could be most likely the issue here. I tried the "ScaleHLS 2.0" as well and the same problem still exists. I haven't work on this for quite a while, but I hope this list would help you:

(torch-mlir) ➜  ~ conda list               
# packages in environment at /home/user/anaconda3/envs/torch-mlir:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
bzip2                     1.0.8                h5eee18b_6  
ca-certificates           2024.3.11            h06a4308_0  
filelock                  3.14.0                   pypi_0    pypi
fsspec                    2024.3.1                 pypi_0    pypi
jinja2                    3.1.4                    pypi_0    pypi
ld_impl_linux-64          2.38                 h1181459_1  
libffi                    3.4.4                h6a678d5_1  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libstdcxx-ng              11.2.0               h1234567_1  
libuuid                   1.41.5               h5eee18b_0  
markupsafe                2.1.5                    pypi_0    pypi
mpmath                    1.3.0                    pypi_0    pypi
ncurses                   6.4                  h6a678d5_0  
networkx                  3.3                      pypi_0    pypi
numpy                     1.26.4                   pypi_0    pypi
openssl                   3.0.13               h7f8727e_1  
packaging                 24.0                     pypi_0    pypi
pillow                    10.3.0                   pypi_0    pypi
pip                       24.0            py311h06a4308_0  
python                    3.11.9               h955ad1f_0  
readline                  8.2                  h5eee18b_0  
setuptools                69.5.1          py311h06a4308_0  
sqlite                    3.45.3               h5eee18b_0  
sympy                     1.12                     pypi_0    pypi
tk                        8.6.14               h39e8969_0  
torch                     2.4.0.dev20240505+cpu          pypi_0    pypi
torch-mlir                20240513.100             pypi_0    pypi
torchvision               0.19.0.dev20240505+cpu          pypi_0    pypi
typing-extensions         4.11.0                   pypi_0    pypi
tzdata                    2024a                h04d1e81_0  
wheel                     0.43.0          py311h06a4308_0  
xz                        5.4.6                h5eee18b_1  
zlib                      1.2.13               h5eee18b_1

And python version is 3.11.9 ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants