AMDGPU tweaks #63

luraess · 2024-04-05T08:04:57Z

Looking into the scripts, I see you are using AMDGPU v0.8 which has now the same "convention" as CUDA wrt using threads and blocks as kernel launch params; gridsize = blocks and thus

JACC.jl/ext/JACCAMDGPU/JACCAMDGPU.jl

Lines 7 to 9 in 041c271

    
           threads = min(N, numThreads) 
        
           blocks = ceil(Int, N / threads) 
        
           @roc groupsize = threads gridsize = threads * blocks _parallel_for_amdgpu(f, x...)

should be:

@roc groupsize = threads gridsize = blocks _parallel_for_amdgpu(f, x...)

Regarding the issue JuliaGPU/AMDGPU.jl#614 it could be that using weakdeps (adding Project.toml to /test) in tests could solve the issue as since JACC is using extensions one could make sure not relying on conditional loading but truly on extension mechanism?

Also, it looks like you are running AMDGPU CI on Julia 1.9. There used to be issues because of LLVM on Julia 1.9 and thus Julia 1.10 could globally be preferred (although depending on the GPU 1.9 would work fine).

The text was updated successfully, but these errors were encountered:

pedrovalerolara · 2024-04-10T20:17:04Z

This is an important point, thank you @luraess !!
We have to make a few modifications on AMDGPU according to the "new" syntax.
Apart from changing the groupsize and gridsize, I saw that there are some changes for device synchronization. Not sure, if it is still assumed that the launch of the kernels are synchronous by default or not, that is another test to do before using the new version of AMDGPU.jl.

luraess · 2024-04-10T20:56:24Z

We have to make a few modifications on AMDGPU according to the "new" syntax.

Yes, and I suspect this may address part of your perf issue you report wrt to tuning kernel launch params.

changes for device synchronization

The behaviour of AMDGPU wrt synchronisation should be fairly similar to CUDA. Unless specified otherwise, kernel are launched on the task local default device and normal priority stream. On the default device and default queue, kernel execution is ordered (and would not need explicit synchronisation). AMDGPU.synchronize(), as CUDA, syncs the streams and not the device. A lower-level device sync is also available but is a heavier operations.

williamfgc mentioned this issue Apr 9, 2024

Refactor tests #64

Merged

williamfgc closed this as completed in #64 Apr 9, 2024

williamfgc reopened this Apr 10, 2024

williamfgc mentioned this issue Apr 15, 2024

Fix AMDGPU #69

Merged

williamfgc closed this as completed in #69 Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMDGPU tweaks #63

AMDGPU tweaks #63

luraess commented Apr 5, 2024

pedrovalerolara commented Apr 10, 2024

luraess commented Apr 10, 2024

AMDGPU tweaks #63

AMDGPU tweaks #63

Comments

luraess commented Apr 5, 2024

pedrovalerolara commented Apr 10, 2024

luraess commented Apr 10, 2024