|
| 1 | +Hacking and Debugging RocSHMEM |
| 2 | +============================== |
| 3 | + |
| 4 | +This documentation is mostly for core RocSHMEM developpers. Power users may find it useful. |
| 5 | + |
| 6 | +How to debug parallel programs |
| 7 | +------------------------------ |
| 8 | + |
| 9 | +When using Open MPI as the launch mechanism, you can use `OMPI_MCA_mpi_abort_delay=-1` to keep parallel processes active after a crash. You can then use `ssh -t $nodename rocgdb -p $pid` to connect a `rocgdb` to the failed process. Sometimes errors are caught by UCX, `UCX_HANDLE_ERRORS=freeze` will have the same effect in such cases. |
| 10 | + |
| 11 | +Look into `scripts/functional_tests/GDB_README` about an alternative technique that deploys multiple `xterm` to gdb into parallel processes. |
| 12 | + |
| 13 | +How to use the address sanitizer (ASAN) |
| 14 | +--------------------------------------- |
| 15 | + |
| 16 | +Refer to [General documentation for ASAN on AMD GPUs][1]. |
| 17 | + |
| 18 | +### Compiling with ASAN |
| 19 | + |
| 20 | +If this is a fresh build directory, simply add `-DASAN=ON` to the `cmake` invocation. |
| 21 | + `cmake . <...> -DASAN=ON` |
| 22 | + |
| 23 | +If you are enabling ASAN in a previously used build directory, use `ccmake` to alter the CMake Cache |
| 24 | + `ccmake .` |
| 25 | + |
| 26 | +In the `ccmake` interface: |
| 27 | +1. find and toggle `ASAN` ON |
| 28 | +2. find and delete `COMPILING_TARGETS` (keybind `d`) |
| 29 | + |
| 30 | +Do not forget to delete `COMPILING_TARGETS` again when disabling ASAN (otherwise xnack will remain active, impacting performance). |
| 31 | + |
| 32 | +### Running with ASAN |
| 33 | + |
| 34 | +You may need to add path to `libclang_rt.asan-x86_64.so` to `LD_LIBRARY_PATH` by hand. Depending on the ROCm version, it may be in an unusual place, e.g., `$ROCM_ROOT/lib/llvm/lib/clang/19/lib/linux/libclang_rt.asan-x86_64.so`; `find /opt/rocm -name libclang_rt.asan-x86_64.so` may be required to find it. |
| 35 | + |
| 36 | +ASAN may [crash when using Open MPI][2]. If that happensi, set environment variable `OMPI_MCA_memory=^patcher`. Do not forget to unset this variable when not using ASAN (it will impact performance). |
| 37 | + |
| 38 | +When running the program, the behavior of ASAN can be controlled with the `ASAN_OPTIONS` environment variable. |
| 39 | + |
| 40 | + |
| 41 | + |
| 42 | +### References |
| 43 | + |
| 44 | +[1]: https://rocm.docs.amd.com/projects/llvm-project/en/docs-6.4.0/conceptual/using-gpu-sanitizer.html |
| 45 | +[2]: https://github.com/open-mpi/ompi/issues/13069 |
0 commit comments