Skip to content

MIOpen error tolerance #3638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

Beanavil
Copy link
Member

Scope

While testing the variant 0 for bnorm backward spatial single, we observed that the error tolerance in MIOpenDriver is sub-optimal. For instance, the following results were observed

Backwards prop batch norm verification passed on dx (0.151171)
Backwards prop batch norm verification passed on dscale (0.277024)
Backwards prop batch norm verification passed on dbias (0.217736)
Backwards Prop Batch Norm Verifies on CPU and GPU.

Notice that the execution is successful, according to the driver checks. However, the error reported is high so this should actually be failing.

Notes for the reviewer

  • Fixed the maxrms definition because it was too big compared to the rms values being computed (due to the rms normalization done). The VerifyForward method seems to also be using the same value for maxrms as the new implementation in VerifyBackward.
  • Fixed the normalization of the rms. Previously we were finding the maximum absolute value from both the reference values and the results, but this may not be correct because it reduces the significance of the rms: if the results differ a lot from the reference values (e.g. ref values are order of 10 and results are order 1000) then the normalization will divide the rms obtained (which will be order of 100) by a number order of 1000 and then the normalized rms will be 100/1000 = 0.1 which is way lower than what it should be.
    • The new implementation only takes into account the reference results for computing the normalization factor

@Beanavil Beanavil changed the title MiOpen error tolerance MIOpen error tolerance Mar 17, 2025
test/verify.hpp Outdated
@@ -211,6 +211,7 @@ std::size_t mismatch_diff(R1&& r1, R2&& r2, T diff)
float_equal, diff, std::bind(abs_diff, std::placeholders::_1, std::placeholders::_2)));
}

// Reference values are r1, measured values are r2
template <class R1, class R2>
double rms_range(R1&& r1, R2&& r2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rename these arguments to make it clear that r1 is intended to be the reference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

test/verify.hpp Outdated
Comment on lines 224 to 227
double mag2 = static_cast<double>(*std::max_element(r2.begin(), r2.end(), compare_mag));
double mag =
std::max({std::fabs(mag1), std::fabs(mag2), std::numeric_limits<double>::min()});
return std::sqrt(square_difference) / (std::sqrt(n) * mag);
double normalizer = std::max({std::fabs(mag1), std::numeric_limits<double>::min()});
return std::sqrt(square_difference) / (std::sqrt(n) * normalizer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe this will be a spicy one.
Interesting find though!

@BrianHarrisonAMD
Copy link
Contributor

Starting CI!

Copy link
Contributor

@averinevg averinevg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Beanavil Beanavil force-pushed the miopen-error-tolerance branch from dd06398 to 389e109 Compare March 21, 2025 15:59
@BrianHarrisonAMD
Copy link
Contributor

Restarting CI again.

@BrianHarrisonAMD
Copy link
Contributor

Failing on formatting.
Please run the formatting script.

@Beanavil Beanavil force-pushed the miopen-error-tolerance branch from 389e109 to fea5f7d Compare March 31, 2025 08:21
@Beanavil
Copy link
Member Author

@BrianHarrisonAMD opsie, missed to run this before. Should be ok now!

@Beanavil
Copy link
Member Author

Btw I see that some smoke tests are failing for gfx90a (MI200), but I have double-checked on our side and they pass on our MI200s, how should we proceed for those?

@BrianHarrisonAMD
Copy link
Contributor

Btw I see that some smoke tests are failing for gfx90a (MI200), but I have double-checked on our side and they pass on our MI200s, how should we proceed for those?

Looks like it was stopped / aborted before fully finishing.
Ill run the internal CI to see if it passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants