Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Codes: Return scalars from kernels #2844

Open
christophermaynard opened this issue Jan 13, 2025 · 0 comments
Open

Error Codes: Return scalars from kernels #2844

christophermaynard opened this issue Jan 13, 2025 · 0 comments
Assignees
Labels
LFRic Issue relates to the LFRic domain

Comments

@christophermaynard
Copy link
Collaborator

Some of the new transport scheme kernels have logging functionality, as the algorithm can potentially fail if the conditions are too severe. Apparently, it is not trivial to ascertain this before calling the kernel. Thus these kernels fail gracefully by calling the logger with an Error. For Kernel execution on a GPU having, CPU functionality inside is not a good idea - basically, it won't work.
The solution is for the kernel to throw an error code. This means returning a scalar, which can then be tested higher up the call stack for success or failure - and halt execution gracefully on the CPU. Currently, this is not supported. One reason for this is it is not thread safe. If the kernel is called from a threaded region, then one column may fail and another may not. Which is correct? Well in this case if any column fails, execution should halt. If tested the PSy layer, then which column could be part of the error message. Perhaps, a better solution is to perform a local reduction on the scalar then this would trigger the exit in the Algorithm layer.
Specifically, if the scalar was set to zero before the horizontal domain looping, and set to one in a failing kernel then a reduction performed across threads after kernel execution would result in value one if any column failed and zero otherwise. Thus this could be tested in the Algorithm layer.

Example transport kernel is common/hori_dep_dist_ffsl_kernel_mod.F90

@christophermaynard christophermaynard added the LFRic Issue relates to the LFRic domain label Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LFRic Issue relates to the LFRic domain
Projects
None yet
Development

No branches or pull requests

3 participants