-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Q] A few questions on additional variable metadata and C++ codegen #812
Comments
Ah, I guess it would also depend on the design in the surface language, right? What are you planning to do in that respect? I'd always had a design in mind where we add types like And we could implement any functions that operate on the GPU with the corresponding types then, e.g. Then, we would just propagate these types throughout the AST and MIR. Would that work or is it not at all related to what you're thinking about? |
(Sorry if what I'm saying is naive. I've been out of the loop for a year, so have no idea what the current state is of GPU support.) |
Sorry, I should have explained all of this a bit better. So there are two projects nearing its final stages, one is varmat and the other is the OpenCL support.
This introduces a new way of representing matrices, vectors and row vectors in Stan Math. Instead of The majority of Stan Math functions already support varmat, support for some will be added in the near future, some functions will probably never support varmat (like functions where we dont define the reverse mode and just use some underlying implementation of Eigen).
The backend currently supports all We also have some support in stanc3 for OpenCL, right now only for distributions as those return scalars and we dont need to worry about what does or does not stay on the OpenCL device. OpenCL transformations are part of the transform MIR phase. What currently happens is as follows: data {
int<lower=0> N;
int<lower=0> n_redcards[N];
int<lower=0> n_games[N];
vector[N] rating;
}
parameters {
vector[2] beta;
}
model {
beta[1] ~ normal(0, 10);
beta[2] ~ normal(0, 1);
target += binomial_logit_lpmf(n_redcards | n_games, beta[1] + beta[2] * rating);
}
beta[1] ~ normal(0, 10);
beta[2] ~ normal(0, 1);
target += binomial_logit_lpmf(n_redcards_opencl__ | n_games_opencl__, to_matrix_cl(beta[1] + beta[2] * rating)); This is fine and can provide nice speedups in the ~5 range for moderate sizes. But we could also do more by also supporting other functions. Like: beta[1] ~ normal(0, 10);
beta[2] ~ normal(0, 1);
target += binomial_logit_lpmf(n_redcards_opencl__ | n_games_opencl__, beta[1] + beta[2] * rating_opencl__); gets us a bit more: (This is done with develop Cmdstan and my local branch of stanc3 that does the additional transform MIR steps). The plan was to try and hide this computational details from the user (they just switch the flag on). Could still be convinced otherwise though I think the fact that Stan hides a lot of these computational details from the user is a big plus compared to other tools. |
Very cool! (Sorry, just seeing this now. I need to make sure github emails end up in my inbox.) I very much agree that the strength of Stan is that it is a high level language and that the user does not have to think much about low level details. Part of my motivation with stanc3 was also to push further in that direction by automating various optimizations and program transformations. (Of course, just reproducing the basic functionality of stanc2 took most of the time I had on my contract. Though I wrote a fair amount of static analysis and optimization code, it wasn't hooked up yet when my contract ended.) In this particular case, aren't there situations where you'd end up copying back and forth a lot of stuff to the GPU to the point that it's not giving you a speedup anymore? I can see that if all the (high dimensional) arguments are data, you'd probably always want to run the operation on the GPU if possible, as you only need to copy the data there once. But what happens if some of the high dimensional arguments are parameters. Wouldn't the copying get quite expensive in that case? It looks like it's still worth it in the case about with the rating parameter, but aren't there cases where the cost of copying is simply too high? (Perhaps not, just playing the devil's advocate here.) Also, could you maybe explain a bit more about the difference between the rating and rating_opencl__ cases above? I'm not sure I understand the difference? rating would probably need to get moved to the GPU at some point anyway, right? Or does the top one only perform part of the binomial_logit_lpmf calculation on the GPU and the rest on the CPU? |
Also, those speed-ups are insane! So cool! This'll allow Baysian inference to scale to totally new applications once it's done. |
Even with this design where the compiler essentially decides what gets sent to GPU, perhaps it's useful to just have explicit types for GPU-matrices etc in the MIR? They wouldn't be in the AST then, as the user cannot write them, but one of the optimization passes could then optimize some of the code to operate on GPU-matrices instead and implement the required explicit casts to transfer data to the GPU? |
Yes, you wouldnt just send Right now I split functions in two types:
I then do a pass and just move all fast functions to be run on the GPU. Then do subsequent passes that move additional function to the GPU. And the final pass moves any remaining parameters from the GPU to the CPU. See this summary of benchmarks: https://github.com/rok-cesnovar/stan-math-opencl-benchmarks/blob/main/summaries/radeonvii_vs_i7_distributions.md. For data args we discard the data transfer (because the transfer happens once during sampling so it can be ignored), parameter data transfers are included. If any of the parameters are already on the GPU/OpenCL device the speedup of that function is then similar to the all-data numbers, which is also what we are seeing in the example figure I posted above. For example for the binomial_lpmf(int[], int[], vector) we have this figure: If we do You can see that almost all lpxf functions are faster on the GPU even for the worst case. The worst case is that all arguments are parameters, you copy them just for the lpxf function and that is it. This probably isnt a common case as there is at least some expression on any of the parameters or one of the arguments is data.
The matrixl_cl<int> n_redcards_opencl__;
matrixl_cl<int> n_games_opencl__;
matrixl_cl<double> rating_opencl__; and the constructor gains these at the end: n_redcards_opencl__ = to_matrixl_cl(n_redcards);
n_games_opencl__ = to_matrixl_cl(n_games);
rating_opencl__ = to_matrixl_cl(rating); So for: target += binomial_logit_lpmf(n_redcards_opencl__ | n_games_opencl__, beta[1] + beta[2] * rating_opencl__); we run
on the GPU (there is a multiply(scalar data/param on the CPU, anything on the GPU) overload, then also do the addition on the GPU and finally call this signature in C++: binomial_logit_lpmf(matrix_cl<int>, matrix_cl<int>, var_value<matrix_cl<double>>);
Oh, I didnt realize that was an option. I would be intersted in that yes. For both this and varmat. We now work with these |
There are similar figures to the binomial one for almost all lpdf/lpmf functions: https://unilj-my.sharepoint.com/:f:/g/personal/rokcesnovar_fri1_uni-lj_si/EgPP9IsCdFJJg0Ygm4_4-HcB9FOozC3FayacTGXeaT6LEg?e=ULRnwA Not sure anyone is interested but posting anyway :) |
Of course people are interested! This is so cool! Is there any reason the GLMs are missing? Aren't those things you typically would want to do on the GPU? |
Yeah, I see no reason why you couldn't just add some types that are only in the MIR and not in the AST. At the very least, doing this in a type safe way using some kind of tags rather than using suffixes seems like a good idea. It's one of the things that OCaml gets us I guess, that we have easy variant types. |
GLMs are missing because we knew those are faster always (we did those benchmarks a long time ago), but I have been meaning to run the same benchmarks for those as well. Will add. Will see if I can make the MIR-only type work if not I will bug you :) Thanks! |
@rok-cesnovar , thanks for the explanation of your two pass approach! That's so cool! Really clever to take this approach of first moving some of the stuff to GPU that should always be on GPU, and then seeing what's already on GPU and what further optimizations that enables! That makes sense. |
Thanks! Its a bit conservative. We have been circling around this idea for like two years which is what it took to support all the functions and make the kernel generator. Now the fun/language stuff. |
Also, I see that you guys have been looking at doing automatic kernel fusion. Is that implemented in Stan already? Is the fusion done in C++ or in OCaml? I'd also love to understand how the GPU primitives interact with the reduce_sum. Is there any meaningful interaction? Is there a more general reduce (a parallel fold) for any associative operation, or is there only a sum? |
I bugged @seantalts a year and a half ago with this, which in hindsight was too early :) Underestimated how long the backend would take.
Yes, that is implemented and I think most of the distributions are implemented using this (some require separate manual kernels). @t4c1 is the mastermind behind that operation. Its done in C++ with expression templating. There is a presentation on that from last year https://www.youtube.com/watch?v=0wUxv6Mv61g&ab_channel=IWOCL%26SYCLcon For examples of its use see examples in a distribution (https://github.com/stan-dev/math/blob/develop/stan/math/opencl/prim/bernoulli_logit_lpmf.hpp) or a non-distributuon function (https://github.com/stan-dev/math/blob/develop/stan/math/opencl/rev/mdivide_left_tri_low.hpp). The rest of the kernel generator code is at https://github.com/stan-dev/math/tree/develop/stan/math/opencl/kernel_generator
We havent really tried it with We know this helps because we have seen that GPU vs CPU speedup increases with more chains. Meaning that CPU/GPU speedup is higher for 10 chains than 4 chains, meaning that the GPU is more efficient with more chains (obviously until we hit a limit). The below chart shows for example how many chains you can run in time of a single CPU chain with increasing N: This was done for the same example model from above (the binomial_logit_lpmf). |
There is also some docs on the kernel fusion available: https://mc-stan.org/math/d2/d3c/group__opencl__kernel__generator.html |
Cool! Thanks! It's pretty unbelievable what you guys have been able to achieve in 2 years. :-D |
I warned Sean about Tadej at the Cambridge StanCon :) |
@rok-cesnovar , about your two optimization passes for automating what gets moved to GPU: are they implemented/prototyped yet? If not, perhaps I can help with that? |
I have a minimal prototype for it, will clean it up and make a PR early next week so we can discuss. If you think its good enough we can start with that. Would definitely appreciate help with that. Improving it, using the MIR type approach you mentioned, reviewing, whatever you want. My goal is to have it ready for the May release, which now finally seems like a realistic goal.
These are some of the top of my head. |
Sounds good, @rok-cesnovar ! Just let me know when your PR is ready or if you want to talk anything through with me. Also always happy to do a video call or look over the code. It'd be fun to help out with this and get into stanc3 a bit again. |
I have a few questions on how to improve the implementations for the upcoming varmat and OpenCL PRs and make them more maintainable. This would also cleanup how we do the transform MIR with OpenCL.
Example:
For
matrix[N, N] a;
in the parameters/TP/model block the default codegen isbut with "varmat" we may want to generate that as
conditional_var_value_t<local_scalar_t__, Eigen::MatrixXd> a;
if applicable (conditional_var_value becomes
var_value<Eigen::MatrixXd>
iflocal_scalar_t__
=var
.Additionally if we want this parameter to reside on the GPU, that would be
We need info on what the underlying C++ type will be in the codegen as well as in the transform MIR passes where we try to figure out if varmat/opencl is applicable for a variable. Current OpenCL implementation and also my development branches rely un suffixes (opencl_ and varmat__) to store this info during transform MIR phase and to generate the different C++. But that is not ideal and also not maintainable I guess.
We need lists of supported functions for both varmat and OpenCL. One idea would be to just have separate lists but I was hoping we could add this to the Stan Math signatures somehow.
The text was updated successfully, but these errors were encountered: