Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PAPI rocm_r component segfaults in intercept mode #74

Closed
gcongiu opened this issue Feb 16, 2022 · 18 comments
Closed

PAPI rocm_r component segfaults in intercept mode #74

gcongiu opened this issue Feb 16, 2022 · 18 comments

Comments

@gcongiu
Copy link

gcongiu commented Feb 16, 2022

Testing PAPI rocm_r component (https://bitbucket.org/congiu/papi/branch/2022.01.11_rocm-rewrite) with the code at this link: https://bitbucket.org/congiu/papi/src/b9533e4c207f20d0477174d097bec2df73867f02/src/components/rocm_r/tests/hip_matmul_single_gpu.cpp

on MI100 GPUs with rocm-4.5.0 and rocm-5.0.0 generates the behaviour following reported.

Following is the kernel running with PAPI rocm_r component in sample mode

$ ./hip_matmul_single_gpu
./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192
First kernel run...
rocm:::SQ_INSTS_VALU:device=0 : 77329334272
rocm:::SQ_INSTS_SALU:device=0 : 17188257792
rocm:::SQ_WAVES:device=0 : 1048576
Second kernel run...
rocm:::SQ_INSTS_VMEM_RD:device=0 : 17179869184
rocm:::SQ_INSTS_VMEM_WR:device=0 : 1048576

And with PAPI rocm_r component in intercept mode

$ ROCP_HSA_INTERCEPT=1 ./hip_matmul_single_gpu
./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192
Segmentation fault (core dumped)

Rerunning the above with gdb:

$ ROCP_HSA_INTERCEPT=1 gdb ./hip_matmul_single_gpu
...Starting program: /home/gcongiu/papi/src/components/rocm_r/tests/./hip_matmul_single_gpu
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
/home/gcongiu/papi/src/components/rocm_r/tests/./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192
[New Thread 0x7fffd303d700 (LWP 68100)]

Program received signal SIGSEGV, Segmentation fault.
0x00000000000127e0 in ?? ()
Missing separate debuginfos, use: debuginfo-install elfutils-libelf-0.176-5.el7.x86_64 glibc-2.17-325.el7_9.x86_64 libdrm-2.4.97-2.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 numactl-libs-2.0.12-5.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64
(gdb) bt
#0  0x00000000000127e0 in ?? ()
#1  0x00007fffd30718b0 in ?? () from /opt/rocm-4.5.0/rocprofiler/lib/librocprofiler64.so
#2  0x00007fffd3077d6d in ?? () from /opt/rocm-4.5.0/rocprofiler/lib/librocprofiler64.so
#3  0x00007ffff6885947 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#4  0x00007ffff6899ce5 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#5  0x00007ffff68812ca in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#6  0x00007ffff686faa8 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#7  0x00007ffff68143d1 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#8  0x00007ffff6814af8 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#9  0x00007ffff68155da in hipStreamCreate () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#10 0x0000000000206bce in main ()

Interestingly, if I use MALLOC_CHECK_=1:

$ MALLOC_CHECK_=1 ROCP_HSA_INTERCEPT=1 ./hip_matmul_single_gpu
./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192
First kernel run...
rocm:::SQ_INSTS_VALU:device=0 : 77329334272
rocm:::SQ_INSTS_SALU:device=0 : 17188257792
rocm:::SQ_WAVES:device=0 : 1048576
Error! Failed starting eventset, error=-8 -> 'Event exists, but cannot be counted due to hardware resource limits’

The segmentation fault disappears. This seems to indicate a memory error in librocprofiler.

Ignore the “Error! …” line. This is generated by PAPI and is due to the fact that the EventSet that initially contained the VALU, SALU and WAVES events has been cleaned up and reused with different events (i.e. VMEM). Since rocprofiler does not allow changing the dispatch callbacks after they have been set PAPI throws an error.

@rahulmula
Copy link

Hi
I'm trying to build https://bitbucket.org/congiu/papi/src/b9533e4c207f20d0477174d097bec2df73867f02/src/components/rocm_r/tests/hip_matmul_single_gpu.cpp
i'm observing below error, pls let me know if you have any instructions on how to build hip_matmul_single_gpu.cpp.
Errors:
papi/src/components/rocm_r/tests# hipcc -O3 -o test hip_matmul_single_gpu.cpp
ld.lld: error: undefined symbol: PAPI_library_init

referenced by hip_matmul_single_gpu.cpp
/tmp/hip_matmul_single_gpu-aed288.o:(main)

ld.lld: error: undefined symbol: PAPI_create_eventset

referenced by hip_matmul_single_gpu.cpp
/tmp/hip_matmul_single_gpu-aed288.o:(main)

ld.lld: error: undefined symbol: PAPI_add_named_event

referenced by hip_matmul_single_gpu.cpp
/tmp/hip_matmul_single_gpu-aed288.o:(main)
referenced by hip_matmul_single_gpu.cpp
/tmp/hip_matmul_single_gpu-aed288.o:(main)
referenced by hip_matmul_single_gpu.cpp
/tmp/hip_matmul_single_gpu-aed288.o:(main)
referenced 2 more times

ld.lld: error: undefined symbol: PAPI_start

referenced by hip_matmul_single_gpu.cpp
/tmp/hip_matmul_single_gpu-aed288.o:(main)
referenced by hip_matmul_single_gpu.cpp
/tmp/hip_matmul_single_gpu-aed288.o:(main)

ld.lld: error: undefined symbol: PAPI_stop

referenced by hip_matmul_single_gpu.cpp
/tmp/hip_matmul_single_gpu-aed288.o:(main)
referenced by hip_matmul_single_gpu.cpp
/tmp/hip_matmul_single_gpu-aed288.o:(main)

ld.lld: error: undefined symbol: PAPI_cleanup_eventset

referenced by hip_matmul_single_gpu.cpp
/tmp/hip_matmul_single_gpu-aed288.o:(main)
referenced by hip_matmul_single_gpu.cpp
/tmp/hip_matmul_single_gpu-aed288.o:(main)

ld.lld: error: undefined symbol: PAPI_destroy_eventset

referenced by hip_matmul_single_gpu.cpp
/tmp/hip_matmul_single_gpu-aed288.o:(main)
referenced by hip_matmul_single_gpu.cpp
/tmp/hip_matmul_single_gpu-aed288.o:(main)

ld.lld: error: undefined symbol: PAPI_shutdown

referenced by hip_matmul_single_gpu.cpp
/tmp/hip_matmul_single_gpu-aed288.o:(main)

ld.lld: error: undefined symbol: PAPI_strerror

referenced by hip_matmul_single_gpu.cpp
/tmp/hip_matmul_single_gpu-aed288.o:(main)
referenced by hip_matmul_single_gpu.cpp
/tmp/hip_matmul_single_gpu-aed288.o:(main)
referenced by hip_matmul_single_gpu.cpp
/tmp/hip_matmul_single_gpu-aed288.o:(main)
referenced 7 more times
clang-14: error: linker command failed with exit code 1 (use -v to see invocation)

@gcongiu
Copy link
Author

gcongiu commented Feb 28, 2022

Hi @rahulmula. Sorry for the late reply. Can you try pulling the latest version of the code and try again?

@gcongiu
Copy link
Author

gcongiu commented Mar 3, 2022

@rahulmula you should clone and build papi using the following options:

git clone https://bitbucket.org/congiu/papi.git
cd papi/src
git checkout 2022.01.11_rocm-rewrite
./configure --with-components="rocm" --with-debug=yes && make -j 8

The above will also build the tests in components/rocm/tests. The one you are interested in has been renamed to hip_matmul_single_gpu.cpp.

@kikimych
Copy link

Is it ok when PAPI_create_eventset( int *EventSet ) returns null pointer and result status == PAPI_OK?

@gcongiu
Copy link
Author

gcongiu commented Mar 14, 2022

Hi @kikimych, you need to set EventSet = PAPI_NULL before passing it to PAPI_create_eventset. If PAPI returns PAPI_OK you should also get a valid event set handle. If that is not the case can you provide some more information on what you are doing?

@ammarwa
Copy link
Collaborator

ammarwa commented Mar 15, 2022

ROCP_HSA_INTERCEPT=0,1,2

Let me first explain the use of the environment flag, the flag was implemented in order to use the profiler tool usually located in "/opt/rocm/rocprofiler/tool/libtool.so" in order to collect counters for all kernels running and post the output to the terminal screen or optionally in a file.

let me use an example to give more information about its usage:

1- We need first to set some environment variables in order for that feature to work:
*if rocprofiler is installed in /opt/rocm then you can replace every "PATH_TO_" with /opt/rocm
export HSA_TOOLS_LIB=PATH_TO_lib/librocprofiler64.so
export ROCP_TOOL_LIB=PATH_TO_rocprofiler/tool/libtool.so
export ROCP_METRICS=PATH_TO_rocprofiler/lib/metrics.xml
export ROCP_INPUT=pmc_input.xml (Example for pmc_input.xml can be found here: https://github.com/ROCm-Developer-Tools/rocprofiler/blob/amd-master/test/tool/pmc_input.xml)

2- Optionally, to store the results in a file:
export ROCP_OUTPUT_DIR=./ (That will store the output in a file in the current directory)

3- export ROCP_HSA_INTERCEPT=0 OR 1 OR 2
0 means interception is disabled
1 means interception is enabled without timestamps
2 means interception is enabled with timestamps

4- Finally, you can run the application normally, for example, ./MatrixTranspose

And these steps are the steps taken by the rocprof script to run rocprofiler on any application to collect counters for that specific application.

So, I don't think this is the functionality needed here, since, you are already collecting counters as shown here:

$ ./hip_matmul_single_gpu ./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192 First kernel run... rocm:::SQ_INSTS_VALU:device=0 : 77329334272 rocm:::SQ_INSTS_SALU:device=0 : 17188257792 rocm:::SQ_WAVES:device=0 : 1048576 Second kernel run... rocm:::SQ_INSTS_VMEM_RD:device=0 : 17179869184 rocm:::SQ_INSTS_VMEM_WR:device=0 : 1048576

@gcongiu
Copy link
Author

gcongiu commented Mar 15, 2022

Hi @ammarwa, right. The tool load mechanism in hsa_init is useful when you have an application that does not contain any performance measurement instrumentation but you still want to profile its behaviour. The only application's entry point for rocprofiler to do that is hsa_init, through its tool load mechanism. Thus, the loaded tool defines the metrics and the callbacks for hsa so that this can hook them to the dispatch and completion queues.

In our case, however, we have applications instrumented with PAPI calls (either manually by the user or automatically by tools, e.g. TAU). What we want is let PAPI users access per-kernel information through the intercept mode of rocprofiler. From the PAPI user point of view the way rocprofiler intercept mode is initialized (and works) is irrelevant. The users only know that to get per-kernel performance counters they need to set ROCP_HSA_INTERCEPT=1 and call PAPI_start/PAPI_stop/PAPI_read as they would normally do for sample mode.

@kikimych
Copy link

kikimych commented Mar 15, 2022

I'm checking value of eventset after https://bitbucket.org/congiu/papi/src/b9533e4c207f20d0477174d097bec2df73867f02/src/components/rocm_r/tests/hip_matmul_single_gpu.cpp#lines-36

(gdb) r
Starting program: /home/user/workspace/papi/src/components/rocm_r/tests/hip_matmul_single_gpu
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
/home/user/workspace/papi/src/components/rocm_r/tests/hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192

Breakpoint 1, PAPI_create_eventset (EventSet=0x7fffffffdc6c) at papi.c:2020
2020 {
(gdb) finish
Run till exit from #0 PAPI_create_eventset (EventSet=0x7fffffffdc6c) at papi.c:2020
0x0000000000217d8d in main (argc=1, argv=0x7fffffffdd98) at hip_matmul_single_gpu.cpp:43
43 papi_errno = PAPI_create_eventset(&eventset);
Value returned is $1 = 0
p eventset
$2 = 0

Same behaviour with unset value of ROCP_HSA_INTERCEPT

@gcongiu
Copy link
Author

gcongiu commented Mar 16, 2022

@kikimych that is a valid handle value for eventset.

@kikimych
Copy link

@kikimych that is a valid handle value for eventset.

I have an issue with restarting program in hipStreamDestroy.
That's what i see in debugger:
(gdb) p callbacks_ $1 = {dispatch = 0x7ffff5d6d260 <dispatch_callback_cb>, create = 0x7fffffffd9b0, destroy = 0x217bd0 <_start>}

callbacks_.destroy is set to main accidentally

Could you please check this fix #82 ?

@gcongiu
Copy link
Author

gcongiu commented Mar 18, 2022

@kikimych I have tried your patch. It still segfaults. Following my gdb session output:

Program received signal SIGSEGV, Segmentation fault.
0x000000000001acf0 in ?? ()
Missing separate debuginfos, use: debuginfo-install elfutils-libelf-0.176-5.el7.x86_64 glibc-2.17-325.el7_9.x86_64 libdrm-2.4.97-2.el7.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 numactl-libs-2.0.12-5.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64
(gdb) bt
#0  0x000000000001acf0 in ?? ()
#1  0x00007fffd2fb0325 in rocprofiler::InterceptQueue::InterceptQueueCreate (agent=..., size=1024, type=0, callback=0x7ffff687fb90,
    data=0x5323f0, private_segment_size=4294967295, group_segment_size=4294967295, queue=0x7fffffffb6f8, tracker_on=@0x7fffffffb6af: false)
    at /home/gcongiu/ROCm/rocprofiler/src/core/intercept_queue.h:129
#2  0x00007fffd2fd8cd9 in rocprofiler::InterceptQueue::QueueCreate (agent=..., size=1024, type=0, callback=0x7ffff687fb90, data=0x5323f0,
    private_segment_size=4294967295, group_segment_size=4294967295, queue=0x7fffffffb6f8)
    at /home/gcongiu/ROCm/rocprofiler/src/core/intercept_queue.h:141
#3  0x00007ffff6884947 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#4  0x00007ffff6898ce5 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#5  0x00007ffff68802ca in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#6  0x00007ffff686eaa8 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#7  0x00007ffff68133d1 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#8  0x00007ffff6813af8 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#9  0x00007ffff68145da in hipStreamCreate () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#10 0x0000000000218433 in main (argc=1, argv=0x7fffffffbe88) at hip_test_single_kernel.cpp:72
(gdb) up
#1  0x00007fffd2fb0325 in rocprofiler::InterceptQueue::InterceptQueueCreate (agent=..., size=1024, type=0, callback=0x7ffff687fb90,
    data=0x5323f0, private_segment_size=4294967295, group_segment_size=4294967295, queue=0x7fffffffb6f8, tracker_on=@0x7fffffffb6af: false)
    at /home/gcongiu/ROCm/rocprofiler/src/core/intercept_queue.h:129
129	      status = callbacks_.create(*queue, callback_data_);
(gdb) print callbacks_
$1 = {dispatch = 0x7ffff5f7a35e <dispatch_callback_cb>, create = 0x1acf0, destroy = 0x18}

It looks like callbacks_ is not correctly initialized.

@kikimych
Copy link

kikimych commented Mar 21, 2022

I have traced it a little bit. So what happens on my machine:

Looks like aslr is disabled somewhere and I have repeating stack segment address from run to run.

Thread 1 "hip_matmul_sing" hit Hardware watchpoint 13: *0x7fffffffd950`

Old value = 3
New value = 2194384
0x00007fffdade6061 in __libc_calloc (n=3, elem_size=8) at malloc.c:3376
3376 malloc.c: No such file or directory.
(gdb) bt
#0 0x00007fffdade6061 in __libc_calloc (n=3, elem_size=8) at malloc.c:3376
#1 0x00007ffff5d6c993 in intercept_ctx_open (ntv_table=0x7ffff5fb1150 <ntv_table>, events_id=0x48e160, num_events=3, rocp_ctx=0x486138) at components/rocm/rocp.c:924
#2 0x00007ffff5d6bdfe in rocp_ctx_open (ntv_table=0x7ffff5fb1150 <ntv_table>, events_id=0x48e160, num_events=3, rocp_ctx=0x486138) at components/rocm/rocp.c:612
#3 0x00007ffff5d6a0ae in rocm_start (ctx=0x3bed70, ctl=0x486110) at components/rocm/rocm.c:428
#4 0x00007ffff5d05989 in PAPI_start (EventSet=0) at papi.c:2799
#5 0x00000000002180f7 in main (argc=1, argv=0x7fffffffdd98) at hip_matmul_single_gpu.cpp:76

Memory address on stack 0x7fffffffd950 is set to _start by standart library calloc function:
(gdb) x/1g *0x7fffffffd950 0x217bd0 <_start>: 0x89485ed18949ed31

later in init callback: https://bitbucket.org/congiu/papi/src/b9533e4c207f20d0477174d097bec2df73867f02/src/components/rocm_r/rocp.c#lines-1035

(gdb) x/3g &dispatch_cb 0x7fffffffd940: 0x00000000004fc420 0x00007fffffffd9b0 0x7fffffffd950: 0x0000000000217bd0

dispatch_cb is a structure allocated on stack. You reset value of dispatch_cb.dispatch, but value of dispatch_cb.destroy and dispatch_cb.create fields are inherrited from intercept_ctx_open() frame.

Here: https://github.com/ROCm-Developer-Tools/rocprofiler/blob/amd-master/src/core/intercept_queue.h#L521 this structure is copied to rocm internal intercept queue. When I build papi as is program segfaults on first call of dispatch_cb.create callback, when i initialize dispatch_cb.create but forget to initialize dispatch_cb.destroy program restarts when context pool is destroyed in hipDestroyStream and this repeats in infinite loop.

Could you please initialize all three callback values or set unused ones to NULL and check?

@gcongiu
Copy link
Author

gcongiu commented Mar 22, 2022

Hi @kikimych, I moved the dispatch_cb to global scope and now it works. Thanks for the clarification.

@gcongiu
Copy link
Author

gcongiu commented Mar 23, 2022

@kikimych
Copy link

Brace initialization sets all fields to zero by default, so it's safe.

@gcongiu
Copy link
Author

gcongiu commented Mar 23, 2022

Right but callbacks_ptrs is still allocated on the stack. When that goes out of context the callbacks value will be overwritten by whatever is put on the stack and potentially cause problems. Am I getting this wrong?

@kikimych
Copy link

Sure, but callbacks is automatic variable and can't be used after leaving scope, compiler ensures it.

@gcongiu
Copy link
Author

gcongiu commented Mar 23, 2022

Ok then I think this issue can be closed. Thanks a lot for the help @kikimych.

@gcongiu gcongiu closed this as completed Mar 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants