Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NNPA] Check return values of zdnn calls. (follow-up of PR#2267) #2338

Draft
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

imaihal
Copy link
Collaborator

@imaihal imaihal commented Jun 27, 2023

This PR is follow-up of PR #2267.

Current main branch does not check the return values of zDNN functions and continues the execution without displaying any messages in onnx-mlir if the function returns errors . (We can see errors of zDNN, but they are not enough to identify the operations that generate errors.) This PR implements code to check it and display API name and the value when it returns errors. This message helps to identify the operation that generates errors and its causes. Also, this PR provides an option --func-call-error-exit to stop execution when the functions return an error.

Example error message in T5

onnx-mlir: Error in zDNN call(ZDNN_LOG): returned 0x20001

Signed-off-by: Haruki Imai [email protected]
Co-authored-by: Yasushi Negishi [email protected]

negiyas and others added 20 commits May 24, 2023 22:47
… krnl-to-llvm conversion.

Signed-off-by: Yasushi Negishi <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
@imaihal imaihal marked this pull request as ready for review June 29, 2023 16:00
Copy link
Collaborator

@tungld tungld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about having two options (feel free to change the option names):

  • nnpa-check-zdnn-return-value to emit code that checks zdnn return values
  • nnpa-stop-if-zdnn-failed to terminate if the zdnn function returns error, this is valid if nnpa-check-zddn-return-value is on.

// Exit
if (funcCallErrorExit) {
MLIRContext *context = rewriter.getContext();
Type int32Ty = IntegerType::get(context, 32);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use i64 since we have an issue with i32 on big endian machine. More info: #778

// compare the return value with ref
std::string errorMsg =
"onnx-mlir: Error in zDNN call(" + apiIdStr(apiId) + "): returned ";
equalOrExit(module, rewriter, loc, ref, ret, errorMsg, funcCallErrorExit);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since NNPA calls already check the return value, I recommend to make this check optional when we need by providing an option to onnx-mlir.

Type int32Ty = IntegerType::get(context, 32);
Value one = create.math.constant(int32Ty, 1);
FlatSymbolRefAttr exitRef = krnl::getOrInsertExit(rewriter, module);
create.llvm.call({}, exitRef, {one});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In onnx-mlir, we return null instead of exit. Could you change it to return null?

llvm::cl::desc("Execution failed when external function call failed."
" Currently only zDNN calls in NNPA are supported."),
llvm::cl::init(false), llvm::cl::cat(OnnxMlirOptions));

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this inside the NNPA folder, since it is for NNPA only. Perhaps, prefixing NNPA options with nnpa is easier for user.

%0 = memref.alloc() : memref<10x10xf32>
%1 = memref.alloc() : memref<1x1x32x64xf32>
"zlow.stick"(%0, %1) : (memref<10x10xf32>, memref<1x1x32x64xf32>) -> ()
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following tests are sharing the same pattern. It looks redundant to test for all operations.

Instead, you can have a test with two or three zlow ops. e.g.

y = zlow.stick(x)
z = zlow.relu(y)
out = zlow.unstick(z)
return out

// CHECK: llvm.call @printf([[VAR_73_1_]], [[VAR_69_1_]]) : (!llvm.ptr, i32) -> ()
// CHECK: [[VAR_74_1_:%.+]] = llvm.mlir.constant(1 : i32) : i32
// CHECK: llvm.call @exit([[VAR_74_1_]]) : (i32) -> ()
// CHECK: llvm.br ^bb2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you call exit, llvm.br ^bb2 looks like a dead code?

@AlexandreEichenberger
Copy link
Collaborator

@imaihal @tungld Hi level question: have we approached the zDNN folks to see if it makes more sense to have them generate an error log and/or abort?

@imaihal
Copy link
Collaborator Author

imaihal commented Aug 2, 2023

@imaihal @tungld Hi level question: have we approached the zDNN folks to see if it makes more sense to have them generate an error log and/or abort?

@AlexandreEichenberger They don't have any plan to update error handling in zDNN now. The zDNN returns error code, but currently onnx-mlir ignores it. So, this PR adds mechanism to handle it.

@tungld
Copy link
Collaborator

tungld commented Aug 2, 2023

@AlexandreEichenberger They don't have any plan to update error handling in zDNN now. The zDNN returns error code, but currently onnx-mlir ignores it. So, this PR adds mechanism to handle it.

@imaihal zDNN checks the error code and display an error message for every zdnn function: https://github.com/IBM/zDNN/blob/main/zdnn/aiu_ops.c#L151

if (status == ZDNN_OK) {
    if (ef & EF_RANGE_VIOLATION_MASK) {
      status =
          ZDNN_STATUS(ZDNN_ELEMENT_RANGE_VIOLATION,
                      "Range violation on tensor data", NO_ARG); /*
                               AIU operation returned a RANGE VIOLATION, set as
                               a warning code and continue processing */
    } else if (ef & ~EF_RANGE_VIOLATION_MASK) {
      return status = ZDNN_STATUS(ZDNN_UNSUPPORTED_AIU_EXCEPTION,
                                  "Unsupported exception on ZDNN operation",
                                  NO_ARG); /* AIU operation returned an
                               unexpected exception, return as a failure */
    }

It would be very easy to add the function_code into the ZDNN_STATUS's message. Do you know why they reject updating the message with function code?

@imaihal
Copy link
Collaborator Author

imaihal commented Aug 2, 2023

@tungld Sorry, correctly speaking, it is not rejected. @negiyas created an issue about IBM/zDNN#17, but not answered yet.
I remember there was a question about whether onnx-mlir checks the return value of zDNN. Currently onnx-mlir does not check, so, we implemented this mechanism.
Are you concerning about performance overhead?

@AlexandreEichenberger
Copy link
Collaborator

Are you concerning about performance overhead?

If you think about it, if you have 100 calls to zdnn matmul, you will have 100 insertion of code that does this check. If it is in the library, there is already some code checking for errors, and it would be only once per operation (e.g. only 1 copy in the matmul code regardless of how many instances of calls there is to the zdnn matmul).

@imaihal imaihal marked this pull request as draft August 28, 2023 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants