Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CudaKernelTest.*SoftmaxGrad* part of SWDEV-477109 #63

Merged

Conversation

xinyazhang
Copy link

Description

Raise the tolerance of CudaKernelTest.SoftmaxGrad tests.

Motivation and Context

Temporary workaround of https://ontrack-internal.amd.com/browse/SWDEV-477109 to make CI happy.

@jeffdaily
Copy link
Collaborator

Is the plan to upstream this? Can we explain why we need the relaxed tolerance?

@xinyazhang
Copy link
Author

Is the plan to upstream this? Can we explain why we need the relaxed tolerance?

No we don't have a plan for now. Mainly due to we don't have a good explanation.

Notably LogSoftmaxGrad_LargeTensor_LastAxis_Float16_NoPowerOfTwo cannot be reproduce with MIOpenDriver and need extra efforts to inspect the very specific input tensors.

Copy link

@TedThemistokleous TedThemistokleous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

@xinyazhang Do we want to backport this fix as well to the 6.2.x builds as well by putting this to rocm6.2_internal_testing. If so cherry-pick, open PR and I can approve.

@xinyazhang
Copy link
Author

xinyazhang commented Sep 26, 2024

@xinyazhang Do we want to backport this fix as well to the 6.2.x builds as well by putting this to rocm6.2_internal_testing. If so cherry-pick, open PR and I can approve.

I'm leading to keep it in current branch because some tolerance has been raised to an unacceptable level (1.0 atol for LogSoftmaxGrad_LargeTensor_LastAxis_Float16_NoPowerOfTwo).

This should be considered as a temporary fix to overcome CI's unhappiness due to the sudden bump of MIOpens' numerical error in 6.3. If 6.2's CI goes green we don't need to raise their tolerance.

@TedThemistokleous
Copy link

@xinyazhang Do we want to backport this fix as well to the 6.2.x builds as well by putting this to rocm6.2_internal_testing. If so cherry-pick, open PR and I can approve.

I'm leading to keep it in current branch because some tolerance has been raised to an unacceptable level (1.0 atol for LogSoftmaxGrad_LargeTensor_LastAxis_Float16_NoPowerOfTwo).

This should be considered as a temporary fix to overcome CI's unhappiness due to the sudden bump of MIOpens' numerical error in 6.3. If 6.2's CI goes green we don't need to raise their tolerance.

Understood. All good then.

@xinyazhang xinyazhang merged commit 5da7c5a into rocm6.3_internal_testing Oct 18, 2024
10 of 15 checks passed
TedThemistokleous pushed a commit that referenced this pull request Jan 3, 2025
Raise the tolerance of CudaKernelTest.SoftmaxGrad tests.

This is a temporary workaround of https://ontrack-internal.amd.com/browse/SWDEV-477109 to make CI happy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants