Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infra Fixes to DLM for Onnxrt Wheel + Int8 #2468

Closed
TedThemistokleous opened this issue Nov 23, 2023 · 9 comments
Closed

Infra Fixes to DLM for Onnxrt Wheel + Int8 #2468

TedThemistokleous opened this issue Nov 23, 2023 · 9 comments
Assignees
Labels
bugfix Fixes a bug found in the code. Continous Integration Pull request updates parts of continous integration pipeline onnxruntime PR changes interaction between MIGraphX and Onnxruntime

Comments

@TedThemistokleous
Copy link
Collaborator

Several pieces being reviewed/added to DLM to fix issue we're seeing in CI and run between QA and Dev

@TedThemistokleous TedThemistokleous added onnxruntime PR changes interaction between MIGraphX and Onnxruntime bugfix Fixes a bug found in the code. Continous Integration Pull request updates parts of continous integration pipeline labels Nov 23, 2023
@TedThemistokleous TedThemistokleous self-assigned this Nov 23, 2023
@TedThemistokleous
Copy link
Collaborator Author

Currently tracking down issues I've found in 6.0 builds through DLM

  1. off the wheel files we're running into the following failure in the Unit tests. These were separated out via: https://github.com/ROCmSoftwarePlatform/DeepLearningModels/pull/1098

Seeing a lot of failures come up with EinSum with MIOpen calls

Currently working on 1 and doing a raw build without wheel to determine if there's something on 6.0 that isn't built in correctly. Seems to effect the int8 quant set of changes as well when running test through DLM.

  1. Failure with migx_ort_bert_distilled_benchmarks - Related to failing test from upstream parameter added to eval_squad.py causing timeouts in CI - Solved by : https://github.com/ROCmSoftwarePlatform/DeepLearningModels/pull/1097

  2. Fix to path name. https://ontrack-internal.amd.com/browse/SWDEV-434010 . Fixed and in review with: https://github.com/ROCmSoftwarePlatform/DeepLearningModels/pull/1100

@TedThemistokleous
Copy link
Collaborator Author

Seeing this fun tidbit

2023-11-23 19:18:21.152706320 [E:onnxruntime:Default, rocm_call.cc:119 RocmCall] MIOPEN failure 7: miopenStatusUnknownError ; GPU=0 ; hostname=aus-navi3x-02 ; file=/workspace/onnxruntime/onnxruntime/core/providers/rocm/reduction/reduction_ops.cc ; line=681 ; expr=miopenReduceTensor( RocmKernel::GetMiopenHandle(rocm_stream), reduce_desc, indices_rocm.get(), indices_bytes, workspace_rocm.get(), workspace_bytes, &one, input_tensor, reinterpret_cast<const HipT*>(input.Data<T>()), &zero, output_tensor, p_output); 

@TedThemistokleous
Copy link
Collaborator Author

Looks like the result is the following after letting it fail:

[  FAILED  ] 135 tests, listed below:
[  FAILED  ] Einsum.ExplicitEinsumAsBatchedMatmulWithBroadcasting_1
[  FAILED  ] Einsum.ExplicitEinsumAsBatchedDiagonalOp_1
[  FAILED  ] Einsum.ImplicitEinsumAsBatchedDiagonalOp_1
[  FAILED  ] Einsum.EinsumTransposeMatMulTwoInputsTestSuite
[  FAILED  ] Einsum.EinsumTransposeMatMulThreeInputsTestSuite
[  FAILED  ] SoftmaxOperator.GH15949_regression_test
[  FAILED  ] PoolTest.GlobalMaxPool
[  FAILED  ] PoolTest.GlobalMaxPool3D
[  FAILED  ] PoolTest.GlobalAveragePool
[  FAILED  ] PoolTest.GlobalAveragePool_Large_128
[  FAILED  ] PoolTest.GlobalAveragePool_Large_256
[  FAILED  ] ReductionOpTest.ReductionVariationTest
[  FAILED  ] ReductionOpTest.ReduceL1_default_axes_keepdims
[  FAILED  ] ReductionOpTest.ReduceL1_do_not_keep_dims
[  FAILED  ] ReductionOpTest.ReduceL1_do_not_keep_dims_2
[  FAILED  ] ReductionOpTest.ReduceL1_keepdims
[  FAILED  ] ReductionOpTest.ReduceL1
[  FAILED  ] ReductionOpTest.ReduceL1_int32
[  FAILED  ] ReductionOpTest.ReduceL2_default_axes_keepdims
[  FAILED  ] ReductionOpTest.ReduceL2_default_axes_do_not_keep_dims
[  FAILED  ] ReductionOpTest.ReduceL2_do_not_keepdims
[  FAILED  ] ReductionOpTest.ReduceL2_do_not_keepdims_2
[  FAILED  ] ReductionOpTest.ReduceL2_keepdims
[  FAILED  ] ReductionOpTest.ReduceL2
[  FAILED  ] ReductionOpTest.ReduceL2_int32
[  FAILED  ] ReductionOpTest.ReduceLogSum
[  FAILED  ] ReductionOpTest.ReduceLogSumExp_default_axes_keepdims
[  FAILED  ] ReductionOpTest.ReduceLogSumExp_default_axes_do_not_keep_dims
[  FAILED  ] ReductionOpTest.ReduceLogSumExp_do_not_keepdims
[  FAILED  ] ReductionOpTest.ReduceLogSumExp_do_not_keepdims_2
[  FAILED  ] ReductionOpTest.ReduceLogSumExp_keepdims
[  FAILED  ] ReductionOpTest.ReduceLogSumExp
[  FAILED  ] ReductionOpTest.ReduceLogSumExp_half
[  FAILED  ] ReductionOpTest.ReduceMax_default_axes_keepdims
[  FAILED  ] ReductionOpTest.ReduceMax_default_axes_do_not_keep_dims
[  FAILED  ] ReductionOpTest.ReduceMax_do_not_keepdims
[  FAILED  ] ReductionOpTest.ReduceMax_do_not_keepdims_2
[  FAILED  ] ReductionOpTest.ReduceMax_keepdims
[  FAILED  ] ReductionOpTest.ReduceMax
[  FAILED  ] ReductionOpTest.ReduceMax_half
[  FAILED  ] ReductionOpTest.ReduceMax_int32
[  FAILED  ] ReductionOpTest.ReduceMax_int64
[  FAILED  ] ReductionOpTest.ReduceMax_int8
[  FAILED  ] ReductionOpTest.ReduceMax_uint8
[  FAILED  ] ReductionOpTest.ReduceMean_do_not_keepdims
[  FAILED  ] ReductionOpTest.ReduceMean_keepdims
[  FAILED  ] ReductionOpTest.ReduceMean
[  FAILED  ] ReductionOpTest.ReduceMean_int32
[  FAILED  ] ReductionOpTest.ReduceMin_default_axes_keepdims
[  FAILED  ] ReductionOpTest.ReduceMin_default_axes_do_not_keep_dims
[  FAILED  ] ReductionOpTest.ReduceMin_default_axes_do_not_keep_dims_2D
[  FAILED  ] ReductionOpTest.ReduceMin_do_not_keepdims
[  FAILED  ] ReductionOpTest.ReduceMin_do_not_keepdims_2
[  FAILED  ] ReductionOpTest.ReduceMin_keepdims
[  FAILED  ] ReductionOpTest.ReduceMin
[  FAILED  ] ReductionOpTest.ReduceMin_half
[  FAILED  ] ReductionOpTest.ReduceMin_int32
[  FAILED  ] ReductionOpTest.ReduceMin_int8
[  FAILED  ] ReductionOpTest.ReduceMin_uint8
[  FAILED  ] ReductionOpTest.ReduceSum
[  FAILED  ] ReductionOpTest.ReduceSumAxesInitializerOpset13
[  FAILED  ] ReductionOpTest.ReduceSum_axes02
[  FAILED  ] ReductionOpTest.ReduceSum_int32
[  FAILED  ] ReductionOpTest.ReduceSumHalfHalf_2
[  FAILED  ] ReductionOpTest.ReduceSumBFloat16_2
[  FAILED  ] ReductionOpTest.ReduceSum_int64
[  FAILED  ] ReductionOpTest.ReduceSum_keepdims
[  FAILED  ] ReductionOpTest.ReduceSum_int32_axes_input
[  FAILED  ] ReductionOpTest.ReduceSumSquare
[  FAILED  ] ReductionOpTest.ReduceSumSquare_do_not_keepdims
[  FAILED  ] ReductionOpTest.ReduceSumSquare_keepdims
[  FAILED  ] ReductionOpTest.ReduceProd_default_axes_keepdims
[  FAILED  ] ReductionOpTest.ReduceProd_default_axes_do_not_keep_dims
[  FAILED  ] ReductionOpTest.ReduceProd_do_not_keepdims
[  FAILED  ] ReductionOpTest.ReduceProd_do_not_keepdims_2
[  FAILED  ] ReductionOpTest.ReduceProd_keepdims
[  FAILED  ] ReductionOpTest.ReduceProd
[  FAILED  ] ReductionOpTest.ReduceProd_int32
[  FAILED  ] ReductionOpTest.ArgMax
[  FAILED  ] ReductionOpTest.ArgMax_do_not_keepdims
[  FAILED  ] ReductionOpTest.ArgMax_do_not_keepdims_2
[  FAILED  ] ReductionOpTest.ArgMax2D
[  FAILED  ] ReductionOpTest.ArgMin
[  FAILED  ] ReductionOpTest.ArgMin_do_not_keepdims
[  FAILED  ] ReductionOpTest.ArgMin_do_not_keepdims_2
[  FAILED  ] ReductionOpTest.ReduceInfMax
[  FAILED  ] ReductionOpTest.ReduceInfMin
[  FAILED  ] ReductionOpTest.ReduceInfLogSumExp
[  FAILED  ] ReductionOpTest.ReduceMax_KR_parallel
[  FAILED  ] ReductionOpTest.ReduceMax_KR
[  FAILED  ] ReductionOpTest.ReduceMax_KR_keepdims
[  FAILED  ] ReductionOpTest.ReduceMax_RK
[  FAILED  ] ReductionOpTest.ReduceMax_RK_keepdims
[  FAILED  ] ReductionOpTest.ReduceMax_RK_parallel
[  FAILED  ] ReductionOpTest.ReduceMax_KRK
[  FAILED  ] ReductionOpTest.ReduceMax_KRK_keepdims
[  FAILED  ] ReductionOpTest.ReduceMax_RKR
[  FAILED  ] ReductionOpTest.ReduceMax_RKR_parallel
[  FAILED  ] ReductionOpTest.ReduceMax_RKR_keepdims
[  FAILED  ] ReductionOpTest.ReduceMax_RKRK
[  FAILED  ] ReductionOpTest.ReduceMax_RKRK_keepdims
[  FAILED  ] ReductionOpTest.ReduceMean_KRK
[  FAILED  ] ReductionOpTest.ReduceMean_KRK_keepdims
[  FAILED  ] ReductionOpTest.ReduceMean_RKR
[  FAILED  ] ReductionOpTest.ReduceMean_RKR_parallel
[  FAILED  ] ReductionOpTest.ReduceMean_RKR_keepdims
[  FAILED  ] ReductionOpTest.ReduceMean_RKRK
[  FAILED  ] ReductionOpTest.ReduceMean_RKRK_keepdims
[  FAILED  ] ReductionOpTest.ReduceMin_KR
[  FAILED  ] ReductionOpTest.ReduceMin_KR_parallel
[  FAILED  ] ReductionOpTest.ReduceMin_KR_keepdims
[  FAILED  ] ReductionOpTest.ReduceMin_RK
[  FAILED  ] ReductionOpTest.ReduceMin_RK_parallel
[  FAILED  ] ReductionOpTest.ReduceMin_RK_keepdims
[  FAILED  ] ReductionOpTest.ReduceMin_KRK
[  FAILED  ] ReductionOpTest.ReduceMin_KRK_keepdims
[  FAILED  ] ReductionOpTest.ReduceMin_RKR
[  FAILED  ] ReductionOpTest.ReduceMin_RKR_parallel
[  FAILED  ] ReductionOpTest.ReduceMin_RKR_keepdims
[  FAILED  ] ReductionOpTest.ReduceMin_RKRK
[  FAILED  ] ReductionOpTest.ReduceMin_RKRK_keepdims
[  FAILED  ] ReductionOpTest.ReduceSum_KRK
[  FAILED  ] ReductionOpTest.ReduceSum_KRK_parallel
[  FAILED  ] ReductionOpTest.ReduceSum_KRK_keepdims
[  FAILED  ] ReductionOpTest.ReduceSum_KRK2
[  FAILED  ] ReductionOpTest.ReduceSum_KRK2_keepdims
[  FAILED  ] ReductionOpTest.ReduceSum_RKR
[  FAILED  ] ReductionOpTest.ReduceSum_RKR_parallel
[  FAILED  ] ReductionOpTest.ReduceSum_RKR_parallel_bigger
[  FAILED  ] ReductionOpTest.ReduceSum_RKR_keepdims
[  FAILED  ] ReductionOpTest.ReduceSum_RKR2
[  FAILED  ] ReductionOpTest.ReduceSum_RKR2_keepdims
[  FAILED  ] ReductionOpTest.ReduceSum_RKRK
[  FAILED  ] ReductionOpTest.ReduceSum_RKRK_keepdims
[  FAILED  ] Scatter.InvalidIndex

135 FAILED TESTS

@TedThemistokleous
Copy link
Collaborator Author

This is still open. Other items need to be completed and are still in review

@TedThemistokleous
Copy link
Collaborator Author

Still need to sort out issue with int8 failing and what the UTs added into DLM for onnxrt are picking up with rocm.

I've rolled back builds and wheel as far as -05 and seeing the same behavior with the failing call and tests.

@TedThemistokleous
Copy link
Collaborator Author

GELU tests are failing consistently for fp16 and always failing at the end of the run. Not sure if related to the other issues we're seeing with the int8 quant side as well.

2023-11-25 01:41:12.673824527 [V:onnxruntime:, sequential_executor.cc:534 ExecuteThePlan] Number of streams: 1
2023-11-25 01:41:12.673833077 [V:onnxruntime:, sequential_executor.cc:184 SessionScope] Begin execution
2023-11-25 01:41:12.673884857 [V:onnxruntime:, sequential_executor.cc:518 ExecuteKernel] stream 0 launch kernel with idx 5
Output 0, diff=0.021484375 index=(0, 0, 360) ort=2.603515625 torch=2.625000000
[FAILED] Passed_cases=0/100; Max_diff=0.021484375; Diff_count=100
F

@TedThemistokleous
Copy link
Collaborator Author

https://github.com/ROCmSoftwarePlatform/DeepLearningModels/pull/1101 fixes issue with DLM conv_to_onnx as well.

@TedThemistokleous
Copy link
Collaborator Author

ROCm/onnxruntime#25 Fixes issues seen with our GELU test failing. This is due to how we invoke fast_math on fp16 but seem to lose accuracy on our Navi based cards to cause such a failure.

Defaulting this to false and adding the proper env vars to toggle this as part of our runs.

@TedThemistokleous
Copy link
Collaborator Author

Blocked to closing this out until we get RC5 fixes to hip to test on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix Fixes a bug found in the code. Continous Integration Pull request updates parts of continous integration pipeline onnxruntime PR changes interaction between MIGraphX and Onnxruntime
Projects
None yet
Development

No branches or pull requests

1 participant