Fix rounding problems in _mean function #453

gabrielpreviato · 2023-02-14T00:17:06Z

As stated in JuliaGPU/CUDA.jl#1773, with the current _mean function, for bigger arrays you get some rounding problems.

julia> A = CUDA.ones((640, 640, 32, 1))
640×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:

julia> T = float(eltype(A))
Float32

julia> λ = convert(T, inv(_mean_denom(A, dims)))
0.0015625f0

julia> sum(Base.Fix1(*,λ), A; dims)
1×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 2, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 3, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

;;; …

[:, :, 30, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 31, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 32, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

This PR changes the order of the multiplication operation, once multiplying them before summing can cause some loss of precision with smaller numbers.

julia> A = CUDA.ones((640, 640, 32, 1))
640×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:

julia> T = float(eltype(A))
Float32

julia> λ = convert(T, inv(_mean_denom(A, dims)))
0.0015625f0

julia> sum(A; dims) .* λ
1×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

[:, :, 2, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

[:, :, 3, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

;;; …

[:, :, 30, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

[:, :, 31, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

[:, :, 32, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

Performance wise I can't say the difference it may cause (I'm not an expert on Julia operation performance). If anyone has ideas how to test it, I'm open to learn and try.

mcabbott · 2023-02-14T01:20:39Z

Note that I can't reproduce this, I get 1.0f0. If details of the GPU change how mapreduce gets done, then I would generically expect small floating-point differences. Whether 100 eps is small or not in this context I'm not sure.

What #443 was trying to avoid is allocating two arrays for the output.

I think it will also be more prone to overflow, as here: JuliaGPU/CUDA.jl#1773 (comment)
For a given size in memory, this will be a problem for complete mean (no dims) first, that case (a few lines up) does the simple sum then divide (and #443 did not change it).

maleadt · 2023-02-14T10:39:45Z

Without fully understanding what the actual cause is we shouldn't just revert the change from #443.

maleadt · 2023-02-14T11:00:27Z

What #443 was trying to avoid is allocating two arrays for the output.

That we can do by using rmul! or sum(f, A; dims) .*= λ

mcabbott · 2023-02-14T13:04:17Z

rmul! or sum(f, A; dims) .*= λ

Indeed. Maybe #443 was concerned about making more kernel launches? Never very sure what's going to be fast or not in GPU land.

more prone to overflow

Thinking more... acting on ones(T,L) the naiive way adds up to L first, while the present code adds things of size 1/L. If T(L) overflows, then probably T(1/L) is saved from underflow only by subnormals. The better thing would probably be scaling by 1/sqrt(L) before summing, and another 1/sqrt(L) after. Base does not do this.

gabrielpreviato mentioned this pull request Feb 14, 2023

Weird behaviour of mean function in CuArray JuliaGPU/CUDA.jl#1773

Open

Changing order of operations in mean function

5a46a4e

maleadt added the needs decision label Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix rounding problems in _mean function #453

Fix rounding problems in _mean function #453

Uh oh!

gabrielpreviato commented Feb 14, 2023

Uh oh!

mcabbott commented Feb 14, 2023

Uh oh!

maleadt commented Feb 14, 2023 •

edited

Loading

Uh oh!

maleadt commented Feb 14, 2023

Uh oh!

mcabbott commented Feb 14, 2023

Uh oh!

Uh oh!

Fix rounding problems in _mean function #453

Are you sure you want to change the base?

Fix rounding problems in _mean function #453

Uh oh!

Conversation

gabrielpreviato commented Feb 14, 2023

Uh oh!

mcabbott commented Feb 14, 2023

Uh oh!

maleadt commented Feb 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Feb 14, 2023

Uh oh!

mcabbott commented Feb 14, 2023

Uh oh!

Uh oh!

maleadt commented Feb 14, 2023 •

edited

Loading