LayerNorm unit test fails in NGC Docker 24.10 environment #2

fanzhongyi · 2024-11-12T07:01:30Z

I am encountering an issue where the LayerNorm unit tests fail during execution in the ngc docker 24.10 environment. Specifically, the gradient matching between the Triton-based TritonLayerNorm and the PyTorch standard torch.nn.LayerNorm is not passing. It seems that the gradients for the weight and bias parameters in the custom Triton-based LayerNorm implementation are not being calculated properly. TThe assertion error message is:

tests/test_layernorm.py:69: AssertionError ========================================================================== short test summary info =========================================================================== FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[1-128-256] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[8-512-1024] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[16-256-512] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[4-1024-768] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[8-1024-1024] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[16-1024-1024] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[32-512-1024] - AssertionError: LayerNorm weight gradients don't match! ================================================================== 7 failed, 36 passed in 93.55s (0:01:33) ===================================================================

The text was updated successfully, but these errors were encountered:

dame-cell · 2024-11-12T07:07:00Z

Ah yes I change the layernorm kernel few weeks ago I might have to check the test again

Thank you so much reporting
Will fix it by today

dame-cell · 2024-11-12T11:45:36Z

Hey , I was able to fix it.
the problem was that the backward kernel return only the input grads and w and b grad was no longer outputted

you can now run the test and it should work

============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.3.3, pluggy-1.5.0
plugins: time-machine-2.14.1, typeguard-4.3.0, anyio-4.4.0
collected 16 items                                                             

test_layernorm.py ................                                       [100%]

============================== 16 passed in 8.63s ==============================
add Codeadd Markdown

fanzhongyi · 2024-11-12T13:17:52Z

Thank you for your prompt response. I have tested the fix, but unfortunately, I encountered a bug in the test code after the recent commit. The issue occurs in the following following lines, where the assertion always passes, even though it shouldn't. After I fixed this, the test still fails.

Additionally, I believe it would be beneficial to add a test case that explicitly compares the gradients of weight and bias between the Triton-based TritonLayerNorm and PyTorch's torch.nn.LayerNorm. This would ensure that the gradient calculations are consistent across both implementations.

Thanks again for your support.

dame-cell · 2024-11-13T11:28:29Z

I see ok I will try to add a test case that explicitly compares the gradients of weight and bias between the Triton-based TritonLayerNorm and PyTorch's torch.nn.LayerNorm.

and I'll check my implementation again

fanzhongyi · 2024-11-13T11:30:32Z

Looking forward to your updates, and thank you very much for your open-source.

dame-cell closed this as completed Nov 12, 2024

dame-cell reopened this Nov 13, 2024

dame-cell pinned this issue Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LayerNorm unit test fails in NGC Docker 24.10 environment #2

LayerNorm unit test fails in NGC Docker 24.10 environment #2

fanzhongyi commented Nov 12, 2024

dame-cell commented Nov 12, 2024

dame-cell commented Nov 12, 2024

fanzhongyi commented Nov 12, 2024

dame-cell commented Nov 13, 2024

fanzhongyi commented Nov 13, 2024

LayerNorm unit test fails in NGC Docker 24.10 environment #2

LayerNorm unit test fails in NGC Docker 24.10 environment #2

Comments

fanzhongyi commented Nov 12, 2024

dame-cell commented Nov 12, 2024

dame-cell commented Nov 12, 2024

fanzhongyi commented Nov 12, 2024

dame-cell commented Nov 13, 2024

fanzhongyi commented Nov 13, 2024