Skip to content
This repository has been archived by the owner on Mar 22, 2019. It is now read-only.

CodeXLGpuProfiler hangs with -C options #1

Open
PXAHyLee opened this issue Jun 5, 2016 · 11 comments
Open

CodeXLGpuProfiler hangs with -C options #1

PXAHyLee opened this issue Jun 5, 2016 · 11 comments

Comments

@PXAHyLee
Copy link

PXAHyLee commented Jun 5, 2016

Hi developer,

My environment is Ubuntu 14.04.4, ROCm platform on an Carrizo APU.

Recently, I use CodeXLGpuProfiler to profile a program that will ends in about 14 seconds (without profiling). When I use -C option, it hangs over 20 minutes. With the -A option, the profiler could finish the task for about 15 seconds. The atp file is attached.

MonteCarloAsianDP.txt

Thanks,
Li

@chesik-amd
Copy link
Contributor

Hi Li,

Would it be possible for you to share your application so we can investigate why it hangs when collecting performance counters?

Thanks,
Chris

@PXAHyLee
Copy link
Author

PXAHyLee commented Jun 7, 2016

Hi Chris,

Here is the application I run. As the README states, only the hsaco version hangs. The README has more information of this benchmark.

MonteCarloAsianDP.tar.gz

Thanks,
Li

@chesik-amd
Copy link
Contributor

chesik-amd commented Jun 8, 2016

Hi Li,

Do you have the most recent version of cloc? When I try to build your sample, I get warnings/errors that I need to use the cl_khr_fp64 extension.

I have to execute:

/opt/rocm/cloc/bin/cloc.sh -clopts -DKHR_DP_EXTENSION MonteCarloAsianDP_Kernels_hsaco.cl

However, when I do that I get the following errors:

relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. relocation Unknown cannot be used when making a shared object; recompile with -fPIC. clang-3.9: error: ld.lld command failed with exit code 1 (use -v to see invocation) ERROR: The following command failed with return code 1. /opt/amd/llvm/bin/clang -I /opt/rocm/libamdgcn/include -include clc/clc.h -Dcl_clang_storage_class_specifiers -Dcl_khr_fp64 -target amdgcn--amdhsa -mcpu=carrizo -Xclang -mlink-bitcode-file -Xclang /opt/rocm/libamdgcn/lib/libamdgcn.carrizo.bc -o /home/jenkins/chris_test/montecarlo/MonteCarloAsianDP/MonteCarloAsianDP_Kernels_hsaco.hsaco /home/jenkins/chris_test/montecarlo/MonteCarloAsianDP/MonteCarloAsianDP_Kernels_hsaco.cl

@PXAHyLee
Copy link
Author

PXAHyLee commented Jun 9, 2016

Hi, Chris

I use cloc 1.0.10.

I get this warning too but ignoring the pragma warning can also pass the CPU side verification. This pragma can be found in several OpenCL kernels that use double type but none of these benchmarks I run fail the CPU-side verification.

I try the command
/opt/rocm/cloc/bin/cloc.sh -clopts -DKHR_DP_EXTENSION MonteCarloAsianDP_Kernels_hsaco.cl
to suppress the warning and the compiler doesn't emit the error message that you mentioned. The attached program can still be executed with no error.

[EDIT] After Upgrading the ROCm toolchain, I try compiling the case with the cloc 1.0.11, and it can't be compiled.

@chesik-amd
Copy link
Contributor

Hi Li,

I will talk to the folks who work on cloc to see if they can help here. In the meantime, if you find a workaround for the cloc compiler problems, please let me know, so I can look at the profiler issue.

@PXAHyLee
Copy link
Author

Hi Chris,

I tried the CLOC 1.0.13 but in vain. Would it be viable to use CLOC 1.0.10 to investigate this issue? I guess the reason why the latest compiler toolchain fails to compile this case is the decision to switching to code object v2 and use ld.lld as linker. The previous toolchain use amdphdrs (now it is obsolete, mentioned in another repo, LLVM-AMDGPU-Assembler-Extra).

@chesik-amd
Copy link
Contributor

Thanks Li,

I'll see if I can track down 1.0.10 to try out. In the meantime, would it be possible for you to share your executable (and any dependencies it might need at runtime)? I may be able to investigate the profiler issue if I have the executable.

@PXAHyLee
Copy link
Author

PXAHyLee commented Jun 15, 2016

Hi Chris,

Thanks for your reply. Here is the executable I run. The README and Environment has more information.

In short,

  1. make run_hsaco: Execute the application without profiling. make run_profiler: Execute the application with profiler (Use the shell script ./run-profiler.sh)
  2. There's no special dependency to other packages.

MonteCarloAsianDP2.tar.gz

@chesik-amd
Copy link
Contributor

Hi Li,

There is a new version of cloc available on the CLOC GitHub repo (1.0.14). This version added a new switch to cloc.sh (-noshared) that allows the hsaco version of the kernel to be built. If I update your makefile to include the -noshared switch on the cloc.sh command line, I can build your application.

However, when I run the application I get a warning displayed at the terminal:

Warning: sizeof(KernelArg) => 48 kernarg_segment_size => 40

Also, if I manually run the application using "./MonteCarloAsian-hsaco -deseriealize MonteCarloAsianDO_Kernels_hsaco.hsaco" instead of the "make run_hsaco" command, then many times I see a SEGFAULT when "rand()" is called from line 659. I have also seen the SEGFAULT reported in a different location. I haven't looked into the source code, but it feels like there might be some uninitialized data somewhere. So you see a SEGFAULT when running the following command:

"./MonteCarloAsian-hsaco -deseriealize MonteCarloAsianDO_Kernels_hsaco.hsaco"

In cases where I don't see the segfault, I am able to collect performance counters without a problem.

Thanks,
Chris

@PXAHyLee
Copy link
Author

PXAHyLee commented Jun 21, 2016

Hi Chris,

I think The host side program is OK. The Warning: sizeof(KernelArg) => 48 kernarg_segment_size => 40 is about the structure alignment issue in the host program, which I fix that in the second attachment. In the second attachment (the one contains only the executable), the hsaco code is generated from CLOC 1.0.10.

So, may I ask that make run_profiler, which the host executes the old hsaco code, be able to collect the performance counter?

When it comes to the new CLOC toolchain, the program receives SEGFAULT after it does the first kernel launch, and when it prepares the random data for the second kernel. That is, the program launched the kernel and the program caught the SEGFAULT after a short time no matter what it does. (In this case, it happens to prepare the random data by calling rand().)

The only thing changed is the way to generating hsaco (CLOC 1.0.10, 1.0.11, 1.0.14). I still try to work around the issue by modifying the kernel code to let the case be compiled by the latest CLOC 1.0.14 and executes on the machine.

Thanks,
Li

@PXAHyLee
Copy link
Author

Hi Chris,

I investigate on the benchmark and (finally) find out at least one (maybe there is other issue that I haven't found) builtin math function giving me wrong answer when I use -noshared option. I open an issue about it. These wrong answers crash my applications.

Thanks,
Li

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants