-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emulator Refactor #118
Comments
This looks really exciting. I have a few questions, since I'm just getting acquainted with Tensorflow myself.
I haven't yet had success locating what the functional form of the TF ARD kernel is. Do you happen to have a link?
Regardless, I'm super excited about this proof of concept. It seems like it could be a really powerful improvement for the code! |
@iancze Thanks :) . I'm actually going to try and see what PyTorch implementation would look like. The TF code is a lot of boilerplate and PyTorch might end up with significantly cleaner code. As for the the TF ARD Kernel you can't find it because I had to custom make it with the help of a TFP engineer. See here for discussion and implementation. PyTorch's Pyro (the equivalent relationship as Tensorflow and Tensorflow Probability, it seems) has a native ARD Kernel for their GP module. I think I am mistaken in what I mean by the "hard way". As I look at it again, the likelihood is just used for training, and then we can use the trained hyperparameters in the standard GP to sample our weights. Correct me if I'm wrong on that. |
I've gotten up and running with PyTorch/Pyro and so far I much much much prefer it to TensorFlow. Here are some examples of doing a Naive fit- import torch
import pyro
import pyro.contrib.gp as gp
import pyro.distributions as dist
grid = HDF5Interface()
pca = PCAGrid.create(grid)
X = torch.from_numpy(pca.gparams)
y = torch.from_numpy(pca.w)
amplitude = torch.tensor(20.)
lengthscale = torch.tensor([300., 30., 30.])
kernel = gp.kernels.RBF(input_dim=3, variance=amplitude, lengthscale=lengthscale)
gpr = gp.models.GPRegression(X, y, kernel, noise=torch.tensor(1.))
optimizer = torch.optim.Adam(gpr.parameters(), lr=0.01)
loss_fn = pyro.infer.Trace_ELBO().differentiable_loss
losses = []
num_steps = int(1e4)
for i in tqdm.trange(num_steps):
optimizer.zero_grad()
loss = loss_fn(gpr.model, gpr.guide)
loss.backward()
optimizer.step()
losses.append(loss.item()) The biggest problem I have right now is that I don't currently have an easy way of batching this. In other words, Its fitting one GP for all weights instead of (in this example) 4 independent GPs. I have opened a discussion with Pyro to discuss this. As far as speed, on my CPU (8 core) it performs about 50 optimization steps per second. For the 10,000 steps above it takes me ~3 minutes. Which is phenomenal |
This looks pretty promising! You're absolutely write about the emulator likelihood being used just for training. Once you know the GP hyperparameters, the likelihood is not used again (for both v1 and v2 versions). I'm not at all familiar with PyTorch, so I'm a little confused about your comment that you can't fit things independently. Can you set up a separate 3D Gaussian process (Teff, log g, [Fe/H]) for each eigenspectrum weight separately? Once you have the hyperparameters trained for each GP per weight, then you're done, right? When you query the emulator for a spectrum, you just need to ask each GP for what mu and sigma is at the stellar parameters (the posterior predictive distribution), and then these go into Eqn 49 + 50 (of v2 starfish paper, or A17 of v1 starfish paper) to calculate the likelihood for a given set of stellar parameters. I'll write more notes about the Habib et al. likelihood function on #119 in a moment. |
So, to clarify, what I'm currently doing is training ONE Gaussian process to fit all of the eigenspectra weights. This means there is one amplitude and one length-3 lengthscale parameter fit for all the data points. This is what is produced above, and IMO doesn't seem like a bad fit. I could fit |
Ah I see. I think this could work, but will probably end up sacrificing a bit of accuracy in the interpolated spectra, since the amplitudes will need to be different for each weight. For example, in your plot above, the GP looks pretty good for |
Okay, so I've trained (in my case) 6 GPs completely separately, using a for-loop. amplitudes = 20 * torch.ones(pca.m)
lengthscales = torch.tensor([300., 30., 30.]).expand(pca.m, 3)
kernels = [gp.kernels.RBF(input_dim=3, variance=amplitudes[i], lengthscale=lengthscales[i]) for i in range(pca.m)]
priors = torch.tensor(
[[2, 2, 2],
[0.0075, 0.075, 0.075]]
)
[kernel.set_prior("lengthscale", dist.Gamma(priors[0], priors[1]).to_event()) for kernel in kernels]
[kernel.set_prior("variance", dist.Uniform(10, 150)) for kernel in kernels]
gprs = [gp.models.GPRegression(X, yi, kernel, noise=torch.tensor(1.)) for yi, kernel in zip(y, kernels)]
optimizers = [torch.optim.Adam(gpr.parameters(), lr=0.01) for gpr in gprs]
loss_fn = pyro.infer.Trace_ELBO().differentiable_loss
num_steps = int(1e4)
losses = np.zeros((pca.m, num_steps))
for i, (gpr, opt) in enumerate(zip(gprs, optimizers)):
for j in tqdm.trange(num_steps):
opt.zero_grad()
loss = loss_fn(gpr.model, gpr.guide)
loss.backward()
opt.step()
losses[i][j] = loss.item() The parameters for each kernel are independent. I really like how these plots look I'm not sure why. I ought to be able to find a way to vectorize these cheaply using the |
Yes, this looks much better, and I think completely reproduces the v1 emulator. About how long did the training take? |
each 1e4 fit took ~ 2:45 on my machine so about 15 minutes total for 6 eigenspectra. As you can see from the losses this can probably reduced to 5e3 steps each cutting the time in half. |
That's great! If you're happy with the PyTorch version, then I'd say this is a great candidate for merging into the codebase as sort of the go-to emulator version. Folks who need the complexity of the Habib emulator can spend more time training it, but for most inference problems I think this faster version will suffice. |
I've been doing a lot of testing using PyTorch modules for replacing the PCA and Emulator class. I think, actually, that I'd prefer to maintain the Habib likelihood with the hopes that, with enough speed, it is not too painful to train. I've done a lot of prototyping in a jupyter notebook, and so far I've basically driven myself insane with tensor math but I've grown more comfortable with tensors. TimingI've done some benchmarking of a version of a Habib likelihood function I've managed to write in PyTorch, with and without CUDA speedup. Below is a table of the speeds I've tested at multiple eigenspectra components (all times in ms)
There is a significant speed up in the PyTorch version, however the GPU code is actually slower than the CPU code. This is not a straightforward representation of the time to minimize, because the gradiant approximations by the SciPy minimizer and the PyTorch mnimizer require different calcuations. In my testing on ImplementingI think we could GREATLY improve the speed of the habib likelihood if we could rewrite the equations using tensor math so that we can use more built-in methods of PyTorch which likely have LAPACK or BLAS implementations instead of our for-loops and Cython implementations. My current verdict is that I think we should attempt conversion of the PCA and Emulator classes to PyTorch, based on the dramatic speed improvements. One huge disadvantage of this (in my opinion) is that it significantly decreases the ease of access for users. To install PyTorch, you have to choose a version that is compliant with whatever version of CUDA (or no CUDA) is on that specific machine. The pip package only points to one version of about 12 possible active versions. This means we have to pass the installation on to the users and throw errors if PyTorch isn't installed. This is suboptimal in my opinion but I can't see a way of easily detecting a user's CUDA version from within python. Do you think it is appropriate for use to move towards PyTorch? QuestionsI am running into a lot of issues in implementing the Emulator code in PyTorch. The More importantly, above we clarified that the Habib method could only be used for optimizing the covariance kernel parameters, however the literature seems a little different because there is no dependence on Happy holidays, as well! |
I've also been thinking about the overall organization of the emulator code. As of right now, I think we should not put as much emphasis on the PCAGrid class. I think the two points where users should work is either a GridInterface + Interpolator or an Emulator. The intermediate PCAGrid class is merely a wrapper for the linear algebra that facilitates the PCA and ought to only be used in the Emulator. This doesn't mean we can't have a class to do the work, but I don't think we should be saving the PCAGrid to disk but rather we should be saving emulators to disk. At a bare minimum, the only things we want to persist for the emulator are the kernel hyper-parameters. I don't believe we need to actually store the eigenspectra/weights on disk- I did a toy test by PCA decomposing the same set of grid fluxes 100 times and analyzed the weights to look for differences, I did a test along the lines of # This is a new PCA that I've written in PyTorch ~2 times faster than SKlearn
# (and eliminates a dependency)
pca = PCA(n_comp=4)
eig, weights = pca(fluxes) # fluxes is from my HDF5Interface
for _ in range(100):
_, w = pca(fluxes)
np.testing.assert_array_equal(w, weights) and it passed without assertion errors. If we rewrite the emulator in PyTorch all of the hyperparameters of the model exist in a dictionary for the emulator. This would make it extremely easy to save to a JSON like you've done for the sampler parameters. So, my proposal would look a bit more like this- # untrained
grid = HDF5Interface()
emu = Emulator(grid, filename='test_emu.json') At this point, the Emulator would do the PCA decomposition of the grid, calculate any static values like From here # Try calling untrained emulator
fl = emu.load_flux([7023, 4.122, 0.001])
> ValueError: Before calling the emulator you must train it using `Emulator.train()`
emu.train(max_iter=1000, max_processes=10, cuda=False) The training loop could either save parameters to disk every X iterations (and provide some resume functionality) or only after training is complete (at some standard TBD). So, the user does not have to explicitly use a We can use a similar # Reads the parameters and path to grid file from JSON, still recalculates the PCA, PhiPhi, etc.
# This should be okay since instantiating an emulator shouldn't occur anywhere time-sensitive
emu = Emulator.open('test_emu.json') Finally fl = emu.load_flux([7023, 4.122, 0.001]) # Stay consistent with interfaces
# Formally the Emulator is a GP so calling it directly should act as such
mu, cov = emu([7023, 4.122, 0.001]) |
This is great to hear, and thanks for all of the work. Happy Holidays to you (and all our github watchers :) ) as well! [this is response to two comments ago] v1 emulator PyTorch Regarding the tensor notation, if you could write your implementation up in a tex document somewhere this might make it easier to figure out how things fit together and we could iterate on an implementation. I'm a bit rusty with tensor math but maybe this well help us converge on the best way to calculate the Habib likelihood with something like Tensorflow or PyTorch. Regarding V12, let me refresh my memory with what I wrote 3+ years ago... "The Ok, this is a lot of notation jargon just to explain what's going on. Essentially, we're just doing standard GP posterior prediction of the eigenspectra weights onto a single point, The |
For the last point- The "standard kernel" is the same as A.10 (arxiv) used to create block matrix |
[this is in response to your new but not newest comment]
This is great that you are thinking about reorganizing the emulator / interpolator into a more parallel type of functionality. In general, I think your concepts are great and JSON saving is always a good idea. (Though, I always wonder if there are better scientific formats for this kind of stuff). I have a few comments about envisioned use cases and how a user might go about training things. PCA decomposition Second, the pattern in which I used Starfish was the following. Store ~5 Gb library of synthetic spectra on my home disk somewhere. Use PCA to decompose these into eigenspectra, and train the emulator. Then, transfer the eigenspectra and hyperparameters to a cluster where I fit an actual spectrum. In that case, a trained emulator (or even just the PCA spectra themselves) are very easy to transfer from a staging computer to HPC without needing the full synthetic grid. There may be better use patterns but at least I found this convenient. Some other patterns to consider Or, say a user trains an emulator using [response to newest comment]
The GP hyperparameters determined from training using A.23 (and kernel equation A.10) are the same ones used to do emulation via eqn A.30. I think everything is consistent here among the training and utilization steps. Hopefully that makes sense but please keep asking questions if I'm struggling to make clear what's going on. |
It would be just as easy to ship an emulator as HDF5 with the eigenspectra built in, too. I wanted to try and see what the bare minimum model-persistence would require. So I think we can still structure this such that you create a I believe that adding extra grid points should not affect the GP hyperparameters if they are close to the original grid. Since we are using a very smooth GP Kernel, the hyper-parameters we've trained should pretty well define the function-space of the grid subset, a sort of psuedo-global optimum. When we actually use the GP to sample a function and get our weights, this function is sampled conditioned on the input data (training data). By adding extra training points after training, the function space is automatically constrained due to the kernel being conditionally sampled on the training data. I suspect this would cause issues too far from the original grid subset, but this is all a naive guess anyways. |
+1 for the I think you're probably right on the hyperparameter settings, but, the reason I was thinking there might be a substantial change is because even if you keep |
that's a really good point I hadn't considered. It would be interesting to see the actual variability of the eigenspectra given different grid points around the same area. From this beautiful stack overflow answer The eigenspectra should contain the features that vary the most in the spectra, so how much will, say, the 0th eigenspectra change given one more? it might be drastic or not at all- definitely a good thought experiment. |
Also, on a separate note, it would be a great test of scalability if we could create an emulator for an entire model library (like all the PHOENIX Alpha=0 models) truncated to a certain instrument. That would be really cool to consider for the use-case of someone wanting to process many many different spectra all from the same instrument (this is one of my use-cases, actually). My gut, right now, tells me that actually creating the PCA decomposition wouldn't fail but trying to optimize would be a nightmare. Although if optimization takes a few days on a server, that might still be worth it if you then have a trained emulator for any possible spectrum you might use. |
I think your intuition on the scaling is correct. I mean, if you keep all of the PCA components you can reconstruct the grid flawlessly. The problem is a judgement call of how many do you need to keep to get the accuracy you desire vs. how few you can get away with, since the training time for Habib goes as ~ m^3 I think. Limiting the spectra to a narrow wl range (overlap w/ the instrument you're using, for example, or even a single echelle order) can also increase accuracy with fewer eigenspectra since there are fewer spectral features needed to explain by the eigenspectra. This training scaling is much more favorable with the v1 emulator since the GPs are independent and we are just training on weights, rather than the flux pixels of the synthetic library. |
Long time since I've visited this issue, but |
This is an open issue moving on from #105 .
I have working TensorFlow models for both the naive GP optimization as well as the Habib log-likelihood function. Here is, in general, what the code looks like-
Naive
This optimizes pretty well, and takes ~50s for 1000 iterations. Here is the log-like by iteration plot which shows that we can get away with optimizing for probably 200 iterations
After training, we can save the hyperparameters and use them for recreating the GP and sampling from it, very similarly to how the current code works. I wish the weights were easily to visualize with plots, so I could show the process. Something I could definitely work on.
Habib
This is a little more complex, but works about the same
So, the main divergences from the current habib implementation is that I've opted to use an ARD Kernel from TensorFlow to get the covariance instead of
Starfish.covariance.k
. I then have some garbage to create the block diagonals (TF doesn't have atf.block_diag
, but I've submitted a feature request), and finally, I simply swap out the numpy functions to the tensorflow functions for the linear algebra.Here is a plot of the loss function per iteration as before-
This method would require sampling weights the "hard" way by calculating the many matrices and sampling. Or perhaps we can use the optimized parameters and use the available GP code (like above) for sampling, since it offers a
GP.sample()
method.Notes
Using Tensorflow as a backend means that none of the above code actually evaluates any code. Instead, it creates a graph of every operation and optimally distrubes them among the CPU and GPU. To actually evaluate the code, a simple wrapper has to be used like
In addition, there is a lot more concern for datatypes, as you can see in the above code, as well as special concerns for constraining the trainable values. Something that is super cool, though, is that these graphs that tensorflow makes can actually be visualized and interpreted (a la TensorBoard) and have built in methods for saving models and exploring them.
I do have an MCMC implementation using
tfp.mcmc
to explore the posterior of the hyperparameters. Unfortunately, it runs somewhat slowly and I would guess it could take an hour to get actual convergence. This could just be my poor understanding of MCMC samplers and options, but at first glance I'm not impressed by the performance. HOWEVER, as a proof of concept this shows that we can completely drop theemcee
dependence from the emulator code.Does this look good enough to bring into the emulator code and add TensorFlow as a dependence?
The text was updated successfully, but these errors were encountered: