Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDGPU tweaks #63

Closed
luraess opened this issue Apr 5, 2024 · 2 comments · Fixed by #64 or #69
Closed

AMDGPU tweaks #63

luraess opened this issue Apr 5, 2024 · 2 comments · Fixed by #64 or #69

Comments

@luraess
Copy link

luraess commented Apr 5, 2024

Looking into the scripts, I see you are using AMDGPU v0.8 which has now the same "convention" as CUDA wrt using threads and blocks as kernel launch params; gridsize = blocks and thus

threads = min(N, numThreads)
blocks = ceil(Int, N / threads)
@roc groupsize = threads gridsize = threads * blocks _parallel_for_amdgpu(f, x...)

should be:

@roc groupsize = threads gridsize = blocks _parallel_for_amdgpu(f, x...) 

Regarding the issue JuliaGPU/AMDGPU.jl#614 it could be that using weakdeps (adding Project.toml to /test) in tests could solve the issue as since JACC is using extensions one could make sure not relying on conditional loading but truly on extension mechanism?

Also, it looks like you are running AMDGPU CI on Julia 1.9. There used to be issues because of LLVM on Julia 1.9 and thus Julia 1.10 could globally be preferred (although depending on the GPU 1.9 would work fine).

@pedrovalerolara
Copy link
Collaborator

This is an important point, thank you @luraess !!
We have to make a few modifications on AMDGPU according to the "new" syntax.
Apart from changing the groupsize and gridsize, I saw that there are some changes for device synchronization. Not sure, if it is still assumed that the launch of the kernels are synchronous by default or not, that is another test to do before using the new version of AMDGPU.jl.

@luraess
Copy link
Author

luraess commented Apr 10, 2024

We have to make a few modifications on AMDGPU according to the "new" syntax.

Yes, and I suspect this may address part of your perf issue you report wrt to tuning kernel launch params.

changes for device synchronization

The behaviour of AMDGPU wrt synchronisation should be fairly similar to CUDA. Unless specified otherwise, kernel are launched on the task local default device and normal priority stream. On the default device and default queue, kernel execution is ordered (and would not need explicit synchronisation). AMDGPU.synchronize(), as CUDA, syncs the streams and not the device. A lower-level device sync is also available but is a heavier operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants