Infra Fixes to DLM for Onnxrt Wheel + Int8 #2468

TedThemistokleous · 2023-11-23T18:18:41Z

Several pieces being reviewed/added to DLM to fix issue we're seeing in CI and run between QA and Dev

TedThemistokleous · 2023-11-23T18:41:50Z

Currently tracking down issues I've found in 6.0 builds through DLM

off the wheel files we're running into the following failure in the Unit tests. These were separated out via: https://github.com/ROCmSoftwarePlatform/DeepLearningModels/pull/1098

Seeing a lot of failures come up with EinSum with MIOpen calls

Currently working on 1 and doing a raw build without wheel to determine if there's something on 6.0 that isn't built in correctly. Seems to effect the int8 quant set of changes as well when running test through DLM.

Failure with migx_ort_bert_distilled_benchmarks - Related to failing test from upstream parameter added to eval_squad.py causing timeouts in CI - Solved by : https://github.com/ROCmSoftwarePlatform/DeepLearningModels/pull/1097
Fix to path name. https://ontrack-internal.amd.com/browse/SWDEV-434010 . Fixed and in review with: https://github.com/ROCmSoftwarePlatform/DeepLearningModels/pull/1100

TedThemistokleous · 2023-11-23T19:18:55Z

Seeing this fun tidbit

2023-11-23 19:18:21.152706320 [E:onnxruntime:Default, rocm_call.cc:119 RocmCall] MIOPEN failure 7: miopenStatusUnknownError ; GPU=0 ; hostname=aus-navi3x-02 ; file=/workspace/onnxruntime/onnxruntime/core/providers/rocm/reduction/reduction_ops.cc ; line=681 ; expr=miopenReduceTensor( RocmKernel::GetMiopenHandle(rocm_stream), reduce_desc, indices_rocm.get(), indices_bytes, workspace_rocm.get(), workspace_bytes, &one, input_tensor, reinterpret_cast<const HipT*>(input.Data<T>()), &zero, output_tensor, p_output);

TedThemistokleous · 2023-11-23T20:07:04Z

Looks like the result is the following after letting it fail:

[  FAILED  ] 135 tests, listed below:
[  FAILED  ] Einsum.ExplicitEinsumAsBatchedMatmulWithBroadcasting_1
[  FAILED  ] Einsum.ExplicitEinsumAsBatchedDiagonalOp_1
[  FAILED  ] Einsum.ImplicitEinsumAsBatchedDiagonalOp_1
[  FAILED  ] Einsum.EinsumTransposeMatMulTwoInputsTestSuite
[  FAILED  ] Einsum.EinsumTransposeMatMulThreeInputsTestSuite
[  FAILED  ] SoftmaxOperator.GH15949_regression_test
[  FAILED  ] PoolTest.GlobalMaxPool
[  FAILED  ] PoolTest.GlobalMaxPool3D
[  FAILED  ] PoolTest.GlobalAveragePool
[  FAILED  ] PoolTest.GlobalAveragePool_Large_128
[  FAILED  ] PoolTest.GlobalAveragePool_Large_256
[  FAILED  ] ReductionOpTest.ReductionVariationTest
[  FAILED  ] ReductionOpTest.ReduceL1_default_axes_keepdims
[  FAILED  ] ReductionOpTest.ReduceL1_do_not_keep_dims
[  FAILED  ] ReductionOpTest.ReduceL1_do_not_keep_dims_2
[  FAILED  ] ReductionOpTest.ReduceL1_keepdims
[  FAILED  ] ReductionOpTest.ReduceL1
[  FAILED  ] ReductionOpTest.ReduceL1_int32
[  FAILED  ] ReductionOpTest.ReduceL2_default_axes_keepdims
[  FAILED  ] ReductionOpTest.ReduceL2_default_axes_do_not_keep_dims
[  FAILED  ] ReductionOpTest.ReduceL2_do_not_keepdims
[  FAILED  ] ReductionOpTest.ReduceL2_do_not_keepdims_2
[  FAILED  ] ReductionOpTest.ReduceL2_keepdims
[  FAILED  ] ReductionOpTest.ReduceL2
[  FAILED  ] ReductionOpTest.ReduceL2_int32
[  FAILED  ] ReductionOpTest.ReduceLogSum
[  FAILED  ] ReductionOpTest.ReduceLogSumExp_default_axes_keepdims
[  FAILED  ] ReductionOpTest.ReduceLogSumExp_default_axes_do_not_keep_dims
[  FAILED  ] ReductionOpTest.ReduceLogSumExp_do_not_keepdims
[  FAILED  ] ReductionOpTest.ReduceLogSumExp_do_not_keepdims_2
[  FAILED  ] ReductionOpTest.ReduceLogSumExp_keepdims
[  FAILED  ] ReductionOpTest.ReduceLogSumExp
[  FAILED  ] ReductionOpTest.ReduceLogSumExp_half
[  FAILED  ] ReductionOpTest.ReduceMax_default_axes_keepdims
[  FAILED  ] ReductionOpTest.ReduceMax_default_axes_do_not_keep_dims
[  FAILED  ] ReductionOpTest.ReduceMax_do_not_keepdims
[  FAILED  ] ReductionOpTest.ReduceMax_do_not_keepdims_2
[  FAILED  ] ReductionOpTest.ReduceMax_keepdims
[  FAILED  ] ReductionOpTest.ReduceMax
[  FAILED  ] ReductionOpTest.ReduceMax_half
[  FAILED  ] ReductionOpTest.ReduceMax_int32
[  FAILED  ] ReductionOpTest.ReduceMax_int64
[  FAILED  ] ReductionOpTest.ReduceMax_int8
[  FAILED  ] ReductionOpTest.ReduceMax_uint8
[  FAILED  ] ReductionOpTest.ReduceMean_do_not_keepdims
[  FAILED  ] ReductionOpTest.ReduceMean_keepdims
[  FAILED  ] ReductionOpTest.ReduceMean
[  FAILED  ] ReductionOpTest.ReduceMean_int32
[  FAILED  ] ReductionOpTest.ReduceMin_default_axes_keepdims
[  FAILED  ] ReductionOpTest.ReduceMin_default_axes_do_not_keep_dims
[  FAILED  ] ReductionOpTest.ReduceMin_default_axes_do_not_keep_dims_2D
[  FAILED  ] ReductionOpTest.ReduceMin_do_not_keepdims
[  FAILED  ] ReductionOpTest.ReduceMin_do_not_keepdims_2
[  FAILED  ] ReductionOpTest.ReduceMin_keepdims
[  FAILED  ] ReductionOpTest.ReduceMin
[  FAILED  ] ReductionOpTest.ReduceMin_half
[  FAILED  ] ReductionOpTest.ReduceMin_int32
[  FAILED  ] ReductionOpTest.ReduceMin_int8
[  FAILED  ] ReductionOpTest.ReduceMin_uint8
[  FAILED  ] ReductionOpTest.ReduceSum
[  FAILED  ] ReductionOpTest.ReduceSumAxesInitializerOpset13
[  FAILED  ] ReductionOpTest.ReduceSum_axes02
[  FAILED  ] ReductionOpTest.ReduceSum_int32
[  FAILED  ] ReductionOpTest.ReduceSumHalfHalf_2
[  FAILED  ] ReductionOpTest.ReduceSumBFloat16_2
[  FAILED  ] ReductionOpTest.ReduceSum_int64
[  FAILED  ] ReductionOpTest.ReduceSum_keepdims
[  FAILED  ] ReductionOpTest.ReduceSum_int32_axes_input
[  FAILED  ] ReductionOpTest.ReduceSumSquare
[  FAILED  ] ReductionOpTest.ReduceSumSquare_do_not_keepdims
[  FAILED  ] ReductionOpTest.ReduceSumSquare_keepdims
[  FAILED  ] ReductionOpTest.ReduceProd_default_axes_keepdims
[  FAILED  ] ReductionOpTest.ReduceProd_default_axes_do_not_keep_dims
[  FAILED  ] ReductionOpTest.ReduceProd_do_not_keepdims
[  FAILED  ] ReductionOpTest.ReduceProd_do_not_keepdims_2
[  FAILED  ] ReductionOpTest.ReduceProd_keepdims
[  FAILED  ] ReductionOpTest.ReduceProd
[  FAILED  ] ReductionOpTest.ReduceProd_int32
[  FAILED  ] ReductionOpTest.ArgMax
[  FAILED  ] ReductionOpTest.ArgMax_do_not_keepdims
[  FAILED  ] ReductionOpTest.ArgMax_do_not_keepdims_2
[  FAILED  ] ReductionOpTest.ArgMax2D
[  FAILED  ] ReductionOpTest.ArgMin
[  FAILED  ] ReductionOpTest.ArgMin_do_not_keepdims
[  FAILED  ] ReductionOpTest.ArgMin_do_not_keepdims_2
[  FAILED  ] ReductionOpTest.ReduceInfMax
[  FAILED  ] ReductionOpTest.ReduceInfMin
[  FAILED  ] ReductionOpTest.ReduceInfLogSumExp
[  FAILED  ] ReductionOpTest.ReduceMax_KR_parallel
[  FAILED  ] ReductionOpTest.ReduceMax_KR
[  FAILED  ] ReductionOpTest.ReduceMax_KR_keepdims
[  FAILED  ] ReductionOpTest.ReduceMax_RK
[  FAILED  ] ReductionOpTest.ReduceMax_RK_keepdims
[  FAILED  ] ReductionOpTest.ReduceMax_RK_parallel
[  FAILED  ] ReductionOpTest.ReduceMax_KRK
[  FAILED  ] ReductionOpTest.ReduceMax_KRK_keepdims
[  FAILED  ] ReductionOpTest.ReduceMax_RKR
[  FAILED  ] ReductionOpTest.ReduceMax_RKR_parallel
[  FAILED  ] ReductionOpTest.ReduceMax_RKR_keepdims
[  FAILED  ] ReductionOpTest.ReduceMax_RKRK
[  FAILED  ] ReductionOpTest.ReduceMax_RKRK_keepdims
[  FAILED  ] ReductionOpTest.ReduceMean_KRK
[  FAILED  ] ReductionOpTest.ReduceMean_KRK_keepdims
[  FAILED  ] ReductionOpTest.ReduceMean_RKR
[  FAILED  ] ReductionOpTest.ReduceMean_RKR_parallel
[  FAILED  ] ReductionOpTest.ReduceMean_RKR_keepdims
[  FAILED  ] ReductionOpTest.ReduceMean_RKRK
[  FAILED  ] ReductionOpTest.ReduceMean_RKRK_keepdims
[  FAILED  ] ReductionOpTest.ReduceMin_KR
[  FAILED  ] ReductionOpTest.ReduceMin_KR_parallel
[  FAILED  ] ReductionOpTest.ReduceMin_KR_keepdims
[  FAILED  ] ReductionOpTest.ReduceMin_RK
[  FAILED  ] ReductionOpTest.ReduceMin_RK_parallel
[  FAILED  ] ReductionOpTest.ReduceMin_RK_keepdims
[  FAILED  ] ReductionOpTest.ReduceMin_KRK
[  FAILED  ] ReductionOpTest.ReduceMin_KRK_keepdims
[  FAILED  ] ReductionOpTest.ReduceMin_RKR
[  FAILED  ] ReductionOpTest.ReduceMin_RKR_parallel
[  FAILED  ] ReductionOpTest.ReduceMin_RKR_keepdims
[  FAILED  ] ReductionOpTest.ReduceMin_RKRK
[  FAILED  ] ReductionOpTest.ReduceMin_RKRK_keepdims
[  FAILED  ] ReductionOpTest.ReduceSum_KRK
[  FAILED  ] ReductionOpTest.ReduceSum_KRK_parallel
[  FAILED  ] ReductionOpTest.ReduceSum_KRK_keepdims
[  FAILED  ] ReductionOpTest.ReduceSum_KRK2
[  FAILED  ] ReductionOpTest.ReduceSum_KRK2_keepdims
[  FAILED  ] ReductionOpTest.ReduceSum_RKR
[  FAILED  ] ReductionOpTest.ReduceSum_RKR_parallel
[  FAILED  ] ReductionOpTest.ReduceSum_RKR_parallel_bigger
[  FAILED  ] ReductionOpTest.ReduceSum_RKR_keepdims
[  FAILED  ] ReductionOpTest.ReduceSum_RKR2
[  FAILED  ] ReductionOpTest.ReduceSum_RKR2_keepdims
[  FAILED  ] ReductionOpTest.ReduceSum_RKRK
[  FAILED  ] ReductionOpTest.ReduceSum_RKRK_keepdims
[  FAILED  ] Scatter.InvalidIndex

135 FAILED TESTS

TedThemistokleous · 2023-11-24T14:04:52Z

This is still open. Other items need to be completed and are still in review

TedThemistokleous · 2023-11-25T01:04:50Z

Still need to sort out issue with int8 failing and what the UTs added into DLM for onnxrt are picking up with rocm.

I've rolled back builds and wheel as far as -05 and seeing the same behavior with the failing call and tests.

TedThemistokleous · 2023-11-25T01:42:46Z

GELU tests are failing consistently for fp16 and always failing at the end of the run. Not sure if related to the other issues we're seeing with the int8 quant side as well.

2023-11-25 01:41:12.673824527 [V:onnxruntime:, sequential_executor.cc:534 ExecuteThePlan] Number of streams: 1
2023-11-25 01:41:12.673833077 [V:onnxruntime:, sequential_executor.cc:184 SessionScope] Begin execution
2023-11-25 01:41:12.673884857 [V:onnxruntime:, sequential_executor.cc:518 ExecuteKernel] stream 0 launch kernel with idx 5
Output 0, diff=0.021484375 index=(0, 0, 360) ort=2.603515625 torch=2.625000000
[FAILED] Passed_cases=0/100; Max_diff=0.021484375; Diff_count=100
F

TedThemistokleous · 2023-11-29T05:11:28Z

https://github.com/ROCmSoftwarePlatform/DeepLearningModels/pull/1101 fixes issue with DLM conv_to_onnx as well.

TedThemistokleous · 2023-11-29T05:12:50Z

ROCm/onnxruntime#25 Fixes issues seen with our GELU test failing. This is due to how we invoke fast_math on fp16 but seem to lose accuracy on our Navi based cards to cause such a failure.

Defaulting this to false and adding the proper env vars to toggle this as part of our runs.

TedThemistokleous · 2023-12-06T18:55:37Z

Blocked to closing this out until we get RC5 fixes to hip to test on.

TedThemistokleous added onnxruntime PR changes interaction between MIGraphX and Onnxruntime bugfix Fixes a bug found in the code. Continous Integration Pull request updates parts of continous integration pipeline labels Nov 23, 2023

TedThemistokleous self-assigned this Nov 23, 2023

pramenku closed this as completed Nov 24, 2023

TedThemistokleous reopened this Nov 24, 2023

TedThemistokleous closed this as completed Nov 24, 2023

TedThemistokleous reopened this Nov 24, 2023

TedThemistokleous closed this as completed Nov 25, 2023

TedThemistokleous reopened this Nov 25, 2023

TedThemistokleous mentioned this issue Dec 7, 2023

Enable ORT accuracy tests to verify int8 #1904

Closed

TedThemistokleous closed this as completed Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infra Fixes to DLM for Onnxrt Wheel + Int8 #2468

Infra Fixes to DLM for Onnxrt Wheel + Int8 #2468

TedThemistokleous commented Nov 23, 2023

TedThemistokleous commented Nov 23, 2023

TedThemistokleous commented Nov 23, 2023

TedThemistokleous commented Nov 23, 2023

TedThemistokleous commented Nov 24, 2023

TedThemistokleous commented Nov 25, 2023

TedThemistokleous commented Nov 25, 2023

TedThemistokleous commented Nov 29, 2023

TedThemistokleous commented Nov 29, 2023

TedThemistokleous commented Dec 6, 2023

Infra Fixes to DLM for Onnxrt Wheel + Int8 #2468

Infra Fixes to DLM for Onnxrt Wheel + Int8 #2468

Comments

TedThemistokleous commented Nov 23, 2023

TedThemistokleous commented Nov 23, 2023

TedThemistokleous commented Nov 23, 2023

TedThemistokleous commented Nov 23, 2023

TedThemistokleous commented Nov 24, 2023

TedThemistokleous commented Nov 25, 2023

TedThemistokleous commented Nov 25, 2023

TedThemistokleous commented Nov 29, 2023

TedThemistokleous commented Nov 29, 2023

TedThemistokleous commented Dec 6, 2023