Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic NaN checking as a debugging feature #4181

Open
glwagner opened this issue Mar 7, 2025 · 0 comments
Open

Automatic NaN checking as a debugging feature #4181

glwagner opened this issue Mar 7, 2025 · 0 comments
Labels
feature 🌟 Something new and shiny

Comments

@glwagner
Copy link
Member

glwagner commented Mar 7, 2025

After chatting with @charleskawczynski, I was wondering if a feature that automatically checks and reports on NaN that appear after a kernel launch might be useful.

It's not too hard to implement such a feature.

Basically, it just means inserting a check into launch!:

@inline function _launch!(arch, grid, workspec, kernel!, first_kernel_arg, other_kernel_args...;
exclude_periphery = false,
reduced_dimensions = (),
active_cells_map = nothing)
location = Oceananigans.location(first_kernel_arg)
loop!, worksize = configure_kernel(arch, grid, workspec, kernel!;
location,
exclude_periphery,
reduced_dimensions,
active_cells_map)
# Don't launch kernels with no size
haswork = if worksize isa OffsetStaticSize
length(worksize) > 0
elseif worksize isa Number
worksize > 0
else
true
end
if haswork
loop!(first_kernel_arg, other_kernel_args...)
end
return nothing
end

after loop!, which would be something like

if check_for_nans
    args = (first_kernel_arg, other_kernel_args...)
    for n in 1:length(args)
        if args[n] isa AbstractArray
            found_nan = any(args[n] .== NaN)
            found_nan && error("Found a NaN in the $(n)th argument to $kernel!")
        end
    end
end

In terms of how to implement this, I think the least invasive way is through a global variable, sort of like a log level.

But a more general design would add info to arch. One could even allow general callbacks in arch:

struct CPU{C}
    launch_callback :: C
end

CPU() = CPU(nothing)
has_callback(::CPU{Nothing}) = false
has_callback(::CPU) = true

so that within _launch!,

if has_callback(arch)
    arch.launch_callback(same_args_that_launch_gets...)
end

It'd also be fun to print the index that the NaN(s) were found at.

@glwagner glwagner added the feature 🌟 Something new and shiny label Mar 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature 🌟 Something new and shiny
Projects
None yet
Development

No branches or pull requests

1 participant