Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in testLifoStorage #3355

Open
rrsettgast opened this issue Sep 13, 2024 · 8 comments
Open

bug in testLifoStorage #3355

rrsettgast opened this issue Sep 13, 2024 · 8 comments
Assignees
Labels
type: bug Something isn't working type: new A new issue has been created and requires attention

Comments

@rrsettgast
Copy link
Member

Describe the bug
Occasionally there is an error in the CI on testLifoStorage. It is not reproducible, and rerunning will pass most of the time. However, this isn't a good thing to have laying around. It can be seen here:

https://github.com/GEOS-DEV/GEOS/actions/runs/10854274728/job/30124368903?pr=3340

The output for the failed testLifoStorage is here:

 90/211 Test  #90: testLifoStorage ......................................Subprocess aborted***Exception:   0.71 sec
[==========] Running 7 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 7 tests from LifoStorageTest
[ RUN      ] LifoStorageTest.LifoStorageBufferOnCUDA
Allocated    40.0 B to the HOST  : LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Allocated    40.0 B to the DEVICE: LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Moved    40.0 B to the DEVICE: LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer> 
 LIFO : maximum size 10 buffers 
 LIFO : buffer size 3.8147e-05MB
 LIFO : allocating 3 buffers on host
 LIFO : allocating 2 buffers on device
Allocated   120.0 B to the HOST  : LvArray::Array<float, 2, camp::int_seq<long, 0l, 1l>, long, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Allocated    80.0 B to the DEVICE: LvArray::Array<float, 2, camp::int_seq<long, 0l, 1l>, long, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Freed    80.0 B to the DEVICE: LvArray::Array<float, 2, camp::int_seq<long, 0l, 1l>, long, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Freed   120.0 B to the HOST  : LvArray::Array<float, 2, camp::int_seq<long, 0l, 1l>, long, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Freed    40.0 B to the HOST  : LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Freed    40.0 B to the DEVICE: LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
[       OK ] LifoStorageTest.LifoStorageBufferOnCUDA (99 ms)
[ RUN      ] LifoStorageTest.LifoStorageBufferOnCUDAlarge
Allocated    3.8 MB to the HOST  : LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Allocated    3.8 MB to the DEVICE: LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Moved    3.8 MB to the DEVICE: LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer> 
 LIFO : maximum size 10000 buffers 
 LIFO : buffer size 3.8147MB
 LIFO : allocating 3 buffers on host
 LIFO : allocating 2 buffers on device
Allocated   11.4 MB to the HOST  : LvArray::Array<float, 2, camp::int_seq<long, 0l, 1l>, long, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
***** Controlling expression (should be false): dataPointer[ i ] != (float)(totalNumberOfBuffers-j-1)*elemCnt+i
***** MSG: "" << "\n" << "Expected " << "dataPointer[ i ]" << " " << "==" << " " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << "\n" << "  " << "dataPointer[ i ]" << " = " << dataPointer[ i ] << "\n" << "  " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << " = " << (float)(totalNumberOfBuffers-j-1)*elemCnt+i << "\n"

***** ERROR
***** LOCATION: /tmp/geos/src/coreComponents/common/unitTests/testLifoStorage.cpp:110
***** Block: [0, 0, 0]
***** Thread: [5, 0, 0]
***** Controlling expression (should be false): dataPointer[ i ] != (float)(totalNumberOfBuffers-j-1)*elemCnt+i
***** MSG: "" << "\n" << "Expected " << "dataPointer[ i ]" << " " << "==" << " " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << "\n" << "  " << "dataPointer[ i ]" << " = " << dataPointer[ i ] << "\n" << "  " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << " = " << (float)(totalNumberOfBuffers-j-1)*elemCnt+i << "\n"

***** ERROR
***** LOCATION: /tmp/geos/src/coreComponents/common/unitTests/testLifoStorage.cpp:110
***** Block: [0, 0, 0]
***** Thread: [6, 0, 0]
***** Controlling expression (should be false): dataPointer[ i ] != (float)(totalNumberOfBuffers-j-1)*elemCnt+i
***** MSG: "" << "\n" << "Expected " << "dataPointer[ i ]" << " " << "==" << " " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << "\n" << "  " << "dataPointer[ i ]" << " = " << dataPointer[ i ] << "\n" << "  " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << " = " << (float)(totalNumberOfBuffers-j-1)*elemCnt+i << "\n"

***** ERROR
***** LOCATION: /tmp/geos/src/coreComponents/common/unitTests/testLifoStorage.cpp:110
***** Block: [0, 0, 0]
***** Thread: [7, 0, 0]
***** Controlling expression (should be false): dataPointer[ i ] != (float)(totalNumberOfBuffers-j-1)*elemCnt+i
***** MSG: "" << "\n" << "Expected " << "dataPointer[ i ]" << " " << "==" << " " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << "\n" << "  " << "dataPointer[ i ]" << " = " << dataPointer[ i ] << "\n" << "  " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << " = " << (float)(totalNumberOfBuffers-j-1)*elemCnt+i << "\n"

***** ERROR
***** LOCATION: /tmp/geos/src/coreComponents/common/unitTests/testLifoStorage.cpp:110
***** Block: [0, 0, 0]
***** Thread: [8, 0, 0]
***** Controlling expression (should be false): dataPointer[ i ] != (float)(totalNumberOfBuffers-j-1)*elemCnt+i
***** MSG: "" << "\n" << "Expected " << "dataPointer[ i ]" << " " << "==" << " " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << "\n" << "  " << "dataPointer[ i ]" << " = " << dataPointer[ i ] << "\n" << "  " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << " = " << (float)(totalNumberOfBuffers-j-1)*elemCnt+i << "\n"

***** ERROR
***** LOCATION: /tmp/geos/src/coreComponents/common/unitTests/testLifoStorage.cpp:110
***** Block: [0, 0, 0]
***** Thread: [9, 0, 0]
***** Controlling expression (should be false): dataPointer[ i ] != (float)(totalNumberOfBuffers-j-1)*elemCnt+i
***** MSG: "" << "\n" << "Expected " << "dataPointer[ i ]" << " " << "==" << " " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << "\n" << "  " << "dataPointer[ i ]" << " = " << dataPointer[ i ] << "\n" << "  " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << " = " << (float)(totalNumberOfBuffers-j-1)*elemCnt+i << "\n"

Freed   120.0 B to the HOST  : LvArray::Array<float, 2, camp::int_seq<long, 0l, 1l>, long, LvArray::ChaiBuffer>  Free memory on device: 0.0 B
Freed    40.0 B to the HOST  : LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer>  Free memory on device: 0.0 B
Freed    40.0 B to the DEVICE: LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer>  Free memory on device: 0.0 B
terminate called after throwing an instance of 'umpire::runtime_error'
  what():  ! Umpire runtime_error [/tmp/build/chai/src/chai/src/tpl/umpire/src/umpire/alloc/CudaMallocAllocator.hpp:62]: cudaFree( ptr = 0x7fd642600000 ) failed with error: unspecified launch failure
    Backtrace: 19 frames
    0 0x7fd678bfbe81 No dladdr: /tmp/geos-build/lib/libcommon.so(_ZN6umpire4util49_GLOBAL__N__70505b0d_16_ArrayManager_cpp_ab41d17d15build_backtraceEv+0x31) [0x7fd678bfbe81]
    1 0x7fd678c02e60 No dladdr: /tmp/geos-build/lib/libcommon.so(_ZNK6umpire13runtime_error7messageB5cxx11Ev+0x20) [0x7fd678c02e60]
    2 0x7fd678c02a4b No dladdr: /tmp/geos-build/lib/libcommon.so(_ZN6umpire13runtime_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_i+0x13b) [0x7fd678c02a4b]
    3 0x7fd678c5a3e4 No dladdr: /tmp/geos-build/lib/libcommon.so(_ZN6umpire5alloc19CudaMallocAllocator10deallocateEPv+0x2d4) [0x7fd678c5a3e4]
    4 0x7fd678c59936 No dladdr: /tmp/geos-build/lib/libcommon.so(_ZN6umpire8resource24CudaDeviceMemoryResource10deallocateEPvm+0x266) [0x7fd678c59936]
    5 0x7fd678c023f4 No dladdr: /tmp/geos-build/lib/libcommon.so(_ZN6umpire9Allocator13do_deallocateEPv+0x294) [0x7fd678c023f4]
    6 0x7fd678bfe36e No dladdr: /tmp/geos-build/lib/libcommon.so(_ZN4chai12ArrayManager4freeEPNS_13PointerRecordENS_14ExecutionSpaceE+0x24e) [0x7fd678bfe36e]
    7 0x441ffb No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN7LvArray10ChaiBufferIfE4freeEv+0x3b) [0x441ffb]
    8 0x43d49b No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN4geos15testLifoStorageIN4RAJA6policy4cuda18cuda_exec_explicitINS1_17iteration_mapping6DirectENS1_4cuda11IndexGlobalILNS1_9named_dimE0ELi32ELi0EEENS7_23MaxOccupancyConcretizerELm1ELb0EEEEEviiii+0xa5b) [0x43d49b]
    9 0x49f799 No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x49) [0x49f799]
    10 0x483e08 No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN7testing4Test3RunEv+0xd8) [0x483e08]
    11 0x484e00 No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN7testing8TestInfo3RunEv+0x130) [0x484e00]
    12 0x485905 No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN7testing9TestSuite3RunEv+0x2d5) [0x485905]
    13 0x495cbd No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x41d) [0x495cbd]
    14 0x4a03e9 No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x49) [0x4a03e9]
    15 0x49586a No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN7testing8UnitTest3RunEv+0x5a) [0x49586a]
    16 0x43c1dc No dladdr: /tmp/geos-build/tests/testLifoStorage(main+0x1c) [0x43c1dc]
    17 0x7fd6764637e5 No dladdr: /usr/lib64/libc.so.6(__libc_start_main+0xe5) [0x7fd6764637e5]
    18 0x43b99e No dladdr: /tmp/geos-build/tests/testLifoStorage(_start+0x2e) [0x43b99e]
@rrsettgast rrsettgast added type: bug Something isn't working type: new A new issue has been created and requires attention labels Sep 13, 2024
@rrsettgast
Copy link
Member Author

@sframba @acitrain @jiemeng-total
Is someone available to look into this? It is becoming an issue with high frequency of failed tests.

@sframba
Copy link
Contributor

sframba commented Sep 20, 2024

@sframba @acitrain @jiemeng-total Is someone available to look into this? It is becoming an issue with high frequency of failed tests.

we're on it, I hope we can fix this quickly

@sframba
Copy link
Contributor

sframba commented Sep 20, 2024

If I'm not mistaken, it seems that only the clang build fails, not the gcc one (same cuda version)

@sframba
Copy link
Contributor

sframba commented Sep 20, 2024

Seems hard to reproduce, so far I never got the test failing on RockyLinux:
https://github.com/GEOS-DEV/GEOS/actions/runs/10962440334/job/30441863550

@sframba
Copy link
Contributor

sframba commented Sep 27, 2024

@rrsettgast did you notice an improvement after #3362 ? If so, we can maybe close the issue

@rrsettgast
Copy link
Member Author

Hello. The problem still occurs. I am not at a computer now but if you look at the recent actions you should see failure

@CusiniM
Copy link
Collaborator

CusiniM commented Sep 27, 2024

Hello. The problem still occurs. I am not at a computer now but if you look at the recent actions you should see failure

It failed again on this develop run.

https://github.com/GEOS-DEV/GEOS/actions/runs/11033321924

@sframba
Copy link
Contributor

sframba commented Oct 2, 2024

Ok, it seems that it's always the LifoStorageBufferOnCUDANoDeviceBuffer test that's failing. Maybe we can disable it for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working type: new A new issue has been created and requires attention
Projects
None yet
Development

No branches or pull requests

7 participants