Latency #352

willtebbutt · 2023-01-05T22:33:40Z

With all of the exciting stuff happening with code caching in 1.9, I thought I'd take a look at our latency for some common tasks.

Consider the following code:

using Pkg
pkg"activate ."

# package loading
@time using AbstractGPs, KernelFunctions, Random, LinearAlgebra

# first evaluation
@time begin
    X = randn(5, 25)
    x = ColVecs(X)
    f = GP(SEKernel())
    fx = f(x, 0.1)
    y = rand(fx)
    logpdf(fx, y)

    f_post = posterior(fx, y)
    y_post = rand(f_post(x, 0.1))
end;

# second evaluation
@time begin
    X = randn(5, 25)
    x = ColVecs(X)
    f = GP(SEKernel())
    fx = f(x, 0.1)
    y = rand(fx)
    logpdf(fx, y)

    f_post = posterior(fx, y)
    y_post = rand(f_post(x, 0.1))
end;

It times package load times, and 1st / 2nd evaluation times of some pretty standard AbstractGPs code.
On 1.9, I see the following results:

# package loading
 1.036674 seconds (1.77 M allocations: 115.412 MiB, 3.65% gc time, 17.33% compilation time)

# first evaluation
 1.934089 seconds (3.70 M allocations: 251.913 MiB, 6.21% gc time, 141.09% compilation time)

# second evaluation
 0.000115 seconds (61 allocations: 112.578 KiB)

Overall, this doesn't seem too bad.

However, we're not taking advantage of pre-compilation anywhere within the JuliaGPs ecosystem, so I wanted to know what would happen if we tried that. To this end, I added the following precompile statements in AbstractGPs:

kernels = [SEKernel(), Matern12Kernel(), Matern32Kernel(), Matern52Kernel()]
xs = [ColVecs(randn(2, 3)), RowVecs(randn(3, 2)), randn(3)]
for k in kernels, x in xs
    precompile(kernelmatrix, (typeof(k), typeof(x)))
    precompile(kernelmatrix, (typeof(k), typeof(x), typeof(x)))
end

for x in xs
    precompile(
        _posterior_computations,
        (typeof(Zeros(5)), Matrix{Float64}, typeof(x), Vector{Float64}),
    )
end

# Pre-compile various AbstractGPs-specific things.
precompile(Diagonal, (typeof(Fill(0.1, 10)), ))
precompile(_rand, (typeof(Random.GLOBAL_RNG), typeof(Zeros(5)), Matrix{Float64}, Int))
precompile(_rand, (Xoshiro, typeof(Zeros(5)), Matrix{Float64}, Int))
precompile(_logpdf, (typeof(Zeros(5)), Matrix{Float64}, Vector{Float64}))

I've tried to add only pre-compile statements for low-level code that doesn't get involved in combinations of things. For example, I don't think it makes sense to add a pre-compile statement for kernelmatrix for a sum of kernels because you'd have to compile a separate method instance for each collection of pairs of kernel types that you ever encountered, and I want to avoid a combinatorial explosion.

_logpdf, _rand_, and _posterior_computations are bits of code I've pulled out of logpdf, rand and _posterior_computations which are GP-independent. i.e. they just depend on matrix types etc. This feels fair, because they don't need to be re-compiled for every new kernel that's used, just when the output of kernelmatrix isn't a Matrix{Float64} or whatever.

Anyway, the results are:

# code loading
 1.083321 seconds (1.89 M allocations: 123.372 MiB, 3.33% gc time, 16.66% compilation time)

# first execution
0.466714 seconds (957.73 k allocations: 65.151 MiB, 3.31% gc time, 137.49% compilation time)

# second execution
 0.000111 seconds (61 allocations: 112.578 KiB)

So it looks like by pre-compiling, we can get a really substantial 4x reduction in time-to-first-inference, or whatever we're calling it.

If you use the slightly more complicated kernel

0.1 * SEKernel() + 0.5 * Matern32Kernel()

you see (without pre-compilation):

# first evaluation
2.236816 seconds (4.05 M allocations: 275.188 MiB, 5.90% gc time, 139.38% compilation time)

# second evaluation
0.000161 seconds (100 allocations: 190.422 KiB)

With pre-compilation you see something like:

# first execution
0.553095 seconds (1.06 M allocations: 72.149 MiB, 5.66% gc time, 138.60% compilation time)

# second execution
0.000157 seconds (95 allocations: 241.000 KiB)

So here we see a similar performance boost because we've pre-compiled all of the code to compute the kernelmatrices for the SEKernel and the Matern32Kernel, so the compiler only the code for kernelmatrix of their sum needs to be compiled on the fly.

It does look like there's a small penalty paid in load time, but I think it might typically be outweighted substantially by the compilation savings.

I wonder whether there's a case for adding a for-loop to KernelFunctions that pre-compiles the kernelmatrix, kernelmatrix_diag etc methods for each "simple" kernel, where by "simple" I basically just mean anything that's not a composite kernel, and adding the kinds of method I've discussed above to AbstractGPs. It might make the user experience substantially more pleasant 🤷 . I for one would love to have these savings.

The text was updated successfully, but these errors were encountered:

devmotion · 2023-01-05T23:23:01Z

Wasn't SnoopPrecompile designed for this precompilation scenario?

devmotion · 2023-01-05T23:25:16Z

In SciML we discussed adding Preferences-based precompilation statements to allow users to switch on/off precompilation in a somewhat granular way, depending on their use cases.

willtebbutt · 2023-01-05T23:29:05Z

In SciML we discussed adding Preferences-based precompilation statements to allow users to switch on/off precompilation in a somewhat granular way, depending on their use cases.

Excellent idea! I was just wondering about this. I've not looked too hard at SnoopPrecompile, so will have to do so.

I was wondering in particular about preferences-based precompilation for anything Zygote-related, but I've not been able to get Zygote to precompile properly yet. For example, I've tried adding this code block to AbstractGPs:

for k in kernels, x in xs
    precompile(kernelmatrix, (typeof(k), typeof(x)))
    precompile(kernelmatrix, (typeof(k), typeof(x), typeof(x)))
    @show typeof(k), typeof(x), typeof(kernelmatrix)
    (() -> Zygote._pullback(Zygote.Context(), KernelFunctions.kernelmatrix, k, x))()
    # precompile(Zygote._pullback, (typeof(Zygote.Context()), typeof(kernelmatrix), typeof(k), typeof(x)))
    # out, pb = Zygote._pullback(Zygote.Context(), kernelmatrix, k, x)
end

but when I run

using Pkg
pkg"activate ."

@time @eval using AbstractGPs, KernelFunctions, Random, LinearAlgebra, Zygote

@time @eval begin
    X = randn(5, 25)
    x = ColVecs(X)
    f = GP(SEKernel())
    fx = f(x, 0.1)
    y = rand(fx)
    logpdf(fx, y)
    out, pb = Zygote._pullback(Zygote.Context(), KernelFunctions.kernelmatrix, SEKernel(), x)

    # out, pb = Zygote._pullback(Zygote.Context(), logpdf, fx, y)
    # Zygote.gradient(logpdf, fx, y)
end;

it still takes around 15s. I'm not entirely clear why this should be the case (e.g. I don't see how the method could be invalidated), but maybe I'm missing something obvious.

simsurace · 2023-03-27T15:13:30Z

You may be aware of this by now, but I just wanted to note that in order to precompile the calls in other libraries you need to use SnoopPrecompile. I have been able to do so with massive benefits in TTFX when Zygote is involved.

willtebbutt · 2023-03-27T20:51:23Z

You may be aware of this by now, but I just wanted to note that in order to precompile the calls in other libraries you need to use SnoopPrecompile. I have been able to do so with massive benefits in TTFX when Zygote is involved.

I was very much not aware of this (although maybe that's what David is alluding to above). Have you managed to e.g. get the entirety of the forwards- and reverse-passes to compile nicely?

simsurace · 2023-03-28T08:01:07Z

Yes, exactly. My journey is documented here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latency #352

Latency #352

willtebbutt commented Jan 5, 2023 •

edited

Loading

devmotion commented Jan 5, 2023

devmotion commented Jan 5, 2023

willtebbutt commented Jan 5, 2023

simsurace commented Mar 27, 2023

willtebbutt commented Mar 27, 2023

simsurace commented Mar 28, 2023

Latency #352

Latency #352

Comments

willtebbutt commented Jan 5, 2023 • edited Loading

devmotion commented Jan 5, 2023

devmotion commented Jan 5, 2023

willtebbutt commented Jan 5, 2023

simsurace commented Mar 27, 2023

willtebbutt commented Mar 27, 2023

simsurace commented Mar 28, 2023

willtebbutt commented Jan 5, 2023 •

edited

Loading