Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set -mllvm -amdgpu-function-calls=true for HIP builds #569

Closed
BenWibking opened this issue Mar 18, 2024 · 7 comments
Closed

set -mllvm -amdgpu-function-calls=true for HIP builds #569

BenWibking opened this issue Mar 18, 2024 · 7 comments
Assignees
Labels
AMDGPU affects AMD GPUs compiler bug caused by a bug in a compiler, not in the code itself enhancement New feature or request

Comments

@BenWibking
Copy link
Collaborator

Describe the proposal
We should automatically add -mllvm -amdgpu-function-calls=true to the compiler flags when -DAMREX_GPU_BACKEND=HIP (AMD GPUs). This works around compiler bugs for large GPU kernels (e.g., chemistry networks).

Describe alternatives you've considered
Alternatively, don't use AMD GPUs.

Additional context
See also: AMReX-Astro/Microphysics#1489.

@BenWibking BenWibking added enhancement New feature or request compiler bug caused by a bug in a compiler, not in the code itself labels Mar 18, 2024
@BenWibking BenWibking added the AMDGPU affects AMD GPUs label Mar 18, 2024
@psharda
Copy link
Contributor

psharda commented Mar 18, 2024

I like the alternative you have considered.

@psharda
Copy link
Contributor

psharda commented Mar 18, 2024

On a more serious note, this is easy to do for a single machine (like moth) because one can just update the corresponding profile. However, it does not seem to work in general because amrex sets amdgpu-function-calls=false in its CMake files, which does not seem to be overridden if we set the opposite in the CMake for quokka. This is the case for Setonix -- I had to go into quokka/extern/amrex and manually change it to true, which is of course sub-optimal.

@BenWibking
Copy link
Collaborator Author

BenWibking commented Mar 18, 2024

On a more serious note, this is easy to do for a single machine (like moth) because one can just update the corresponding profile. However, it does not seem to work in general because amrex sets amdgpu-function-calls=false in its CMake files, which does not seem to be overridden if we set the opposite in the CMake for quokka. This is the case for Setonix -- I had to go into quokka/extern/amrex and manually change it to true, which is of course sub-optimal.

Can you open an issue on the AMReX GitHub repo for this problem? Allowing downstream codes to change this is something they need to fix in their CMake settings.

@BenWibking BenWibking mentioned this issue Jul 11, 2024
7 tasks
github-merge-queue bot pushed a commit that referenced this issue Jul 12, 2024
### Description
This updates the Setonix profile to use the latest module versions that
are available. Pawsey updated the software stack in mid-June 2024 to
Cray PE 23.09.

I've compiled and tested the ShockCloud problem on Setonix GPUs. It
appears to work.

### Related issues
#569

### Checklist
_Before this pull request can be reviewed, all of these tasks should be
completed. Denote completed tasks with an `x` inside the square brackets
`[ ]` in the Markdown source below:_
- [x] I have added a description (see above).
- [x] I have added a link to any related issues see (see above).
- [x] I have read the [Contributing
Guide](https://github.com/quokka-astro/quokka/blob/development/CONTRIBUTING.md).
- [ ] I have added tests for any new physics that this PR adds to the
code.
- [ ] I have tested this PR on my local computer and all tests pass.
- [ ] I have manually triggered the GPU tests with the magic comment
`/azp run`.
- [x] I have requested a reviewer for this PR.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@BenWibking
Copy link
Collaborator Author

The underlying problem is that the AMD GPU compiler does register allocation incorrectly: https://discourse.llvm.org/t/the-current-state-of-spilling-function-calls-and-related-problems/2863.

Unfortunately, there is nothing we can do about this until they rewrite their compiler.

@BenWibking
Copy link
Collaborator Author

@psharda This issue is supposed to be fixed in ROCm 6.3.1, which is now on moth. Do you have a short test case for PopIII that I can run that didn't work before?

@BenWibking
Copy link
Collaborator Author

Mike says it works now for their nuclear networks without any of the old workarounds.

@BenWibking
Copy link
Collaborator Author

It looks like this is no longer needed as of ROCm 6.3.1, so I'll close this issue.

@BenWibking BenWibking closed this as not planned Won't fix, can't repro, duplicate, stale Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AMDGPU affects AMD GPUs compiler bug caused by a bug in a compiler, not in the code itself enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants