Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

take_along_axis validation error (race condition?) #4003

Open
naoyam opened this issue Mar 3, 2025 · 0 comments
Open

take_along_axis validation error (race condition?) #4003

naoyam opened this issue Mar 3, 2025 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@naoyam
Copy link
Collaborator

naoyam commented Mar 3, 2025

Reported by @csarofeen.

[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ScatterGatherTest
[ RUN      ] ScatterGatherTest.TakeAlongAxisIntermediateTensorNormalizationAndReduction2
unknown file: Failure
C++ exception with description " INTERNAL ASSERT FAILED at "/opt/pytorch/Fuser/tests/cpp/validator.cpp":115, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues.

Validation error in output 0 on line 994 in file /opt/pytorch/Fuser/tests/cpp/test_scatter_gather.cpp.
  Detected max abs error of: 189.495
    absolute tolerance was set to 0.00343043
    and relative tolerance set to 3.43043e-05
Exception raised from testValidate at /opt/pytorch/Fuser/tests/cpp/validator.cpp:115 (most recent call first):
frame #0: <unknown function> + 0x1c5300 (0xc89c41405300 in ./bin/test_nvfuser)
frame #1: <unknown function> + 0x46cbf0 (0xc89c416acbf0 in ./bin/test_nvfuser)
frame #2: <unknown function> + 0x1116858 (0xc89c42356858 in ./bin/test_nvfuser)
frame #3: <unknown function> + 0x1055ad0 (0xc89c42295ad0 in ./bin/test_nvfuser)
frame #4: <unknown function> + 0x1153b10 (0xc89c42393b10 in ./bin/test_nvfuser)
frame #5: <unknown function> + 0x113be24 (0xc89c4237be24 in ./bin/test_nvfuser)
frame #6: <unknown function> + 0x113c318 (0xc89c4237c318 in ./bin/test_nvfuser)
frame #7: <unknown function> + 0x113c914 (0xc89c4237c914 in ./bin/test_nvfuser)
frame #8: <unknown function> + 0x1149bc0 (0xc89c42389bc0 in ./bin/test_nvfuser)
frame #9: <unknown function> + 0x113caf0 (0xc89c4237caf0 in ./bin/test_nvfuser)
frame #10: <unknown function> + 0x223700 (0xc89c41463700 in ./bin/test_nvfuser)
frame #11: <unknown function> + 0x284c4 (0xe2373de584c4 in /usr/lib/aarch64-linux-gnu/libc.so.6)
frame #12: __libc_start_main + 0x98 (0xe2373de58598 in /usr/lib/aarch64-linux-gnu/libc.so.6)
frame #13: <unknown function> + 0x223d70 (0xc89c41463d70 in ./bin/test_nvfuser)
" thrown in the test body.

To reproduce: NVFUSER_TEST_RANDOM_SEED=1741024131 NVFUSER_TEST_ATEN_RANDOM_SEED=0 test_nvfuser --gtest_filter='ScatterGatherTest.TakeAlongAxisIntermediateTensorNormalizationAndReduction2'
[  FAILED  ] ScatterGatherTest.TakeAlongAxisIntermediateTensorNormalizationAndReduction2 (487 ms)
[----------] 1 test from ScatterGatherTest (487 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (487 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ScatterGatherTest.TakeAlongAxisIntermediateTensorNormalizationAndReduction2
@naoyam naoyam self-assigned this Mar 3, 2025
@naoyam naoyam added the bug Something isn't working label Mar 3, 2025
naoyam added a commit that referenced this issue Mar 3, 2025
Failing likely due to a race condition (#4003)

This should be enabled back as part of the ongoing work on cross
entropy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant