Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LayerNorm unit test fails in NGC Docker 24.10 environment #2

Open
fanzhongyi opened this issue Nov 12, 2024 · 5 comments
Open

LayerNorm unit test fails in NGC Docker 24.10 environment #2

fanzhongyi opened this issue Nov 12, 2024 · 5 comments

Comments

@fanzhongyi
Copy link

I am encountering an issue where the LayerNorm unit tests fail during execution in the ngc docker 24.10 environment. Specifically, the gradient matching between the Triton-based TritonLayerNorm and the PyTorch standard torch.nn.LayerNorm is not passing. It seems that the gradients for the weight and bias parameters in the custom Triton-based LayerNorm implementation are not being calculated properly. TThe assertion error message is:

tests/test_layernorm.py:69: AssertionError ========================================================================== short test summary info =========================================================================== FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[1-128-256] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[8-512-1024] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[16-256-512] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[4-1024-768] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[8-1024-1024] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[16-1024-1024] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[32-512-1024] - AssertionError: LayerNorm weight gradients don't match! ================================================================== 7 failed, 36 passed in 93.55s (0:01:33) ===================================================================

@dame-cell
Copy link
Owner

Ah yes I change the layernorm kernel few weeks ago I might have to check the test again

Thank you so much reporting
Will fix it by today

@dame-cell
Copy link
Owner

Hey , I was able to fix it.
the problem was that the backward kernel return only the input grads and w and b grad was no longer outputted

you can now run the test and it should work

============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.3.3, pluggy-1.5.0
plugins: time-machine-2.14.1, typeguard-4.3.0, anyio-4.4.0
collected 16 items                                                             

test_layernorm.py ................                                       [100%]

============================== 16 passed in 8.63s ==============================
add Codeadd Markdown

@fanzhongyi
Copy link
Author

Thank you for your prompt response. I have tested the fix, but unfortunately, I encountered a bug in the test code after the recent commit. The issue occurs in the following following lines, where the assertion always passes, even though it shouldn't. After I fixed this, the test still fails.

Additionally, I believe it would be beneficial to add a test case that explicitly compares the gradients of weight and bias between the Triton-based TritonLayerNorm and PyTorch's torch.nn.LayerNorm. This would ensure that the gradient calculations are consistent across both implementations.

Thanks again for your support.

@dame-cell dame-cell reopened this Nov 13, 2024
@dame-cell
Copy link
Owner

I see ok I will try to add a test case that explicitly compares the gradients of weight and bias between the Triton-based TritonLayerNorm and PyTorch's torch.nn.LayerNorm.

and I'll check my implementation again

@fanzhongyi
Copy link
Author

Looking forward to your updates, and thank you very much for your open-source.

@dame-cell dame-cell pinned this issue Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants