Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emulator Refactor #118

Closed
mileslucas opened this issue Dec 19, 2018 · 23 comments
Closed

Emulator Refactor #118

mileslucas opened this issue Dec 19, 2018 · 23 comments

Comments

@mileslucas
Copy link
Member

mileslucas commented Dec 19, 2018

This is an open issue moving on from #105 .

I have working TensorFlow models for both the naive GP optimization as well as the Habib log-likelihood function. Here is, in general, what the code looks like-

Naive

X = tf.Variable(pca.gparams)
Y = tf.Variable(pca.w)
conf_priors = np.array(Starfish.config.PCA["priors"]).T

# Create the trainable variables in our model. Constrain to be strictly
# positive by wrapping in softplus and adding a little nudge away from zero.
amplitude = (np.finfo(np.float64).tiny +
           tf.nn.softplus(
               tf.Variable(10. * tf.ones(pca.m, dtype=tf.float64)),
               name='amplitude'))
length_scale = (np.finfo(np.float64).tiny +
               tf.nn.softplus(
                  tf.Variable(np.tile([300., 30., 30.], (pca.m, 1)), dtype=np.float64),
                  name='length_scale'))

# Put Gamma prior on the length scales
rv_scale = tfd.Gamma(
  concentration=conf_priors[0],
  rate=conf_priors[1],
  name='length_scale_prior')


# Define ARD kernel and GP likelihood over observations
kernel = InputScaledKernel(tfk.ExponentiatedQuadratic(amplitude),
                         length_scale,
                         name='ARD_kernel')
gp = tfd.GaussianProcess(kernel=kernel, index_points=X)

# Joint log prob of length_scale params and data
log_likelihood = (tf.reduce_sum(gp.log_prob(Y)) +
                tf.reduce_sum(rv_scale.log_prob(length_scale)))

# Optimization target (neg log likelihood) and optimizer. Use a fairly
# generous learning rate (adaptive optimizer will adapt :))
loss = -log_likelihood
opt = tf.train.AdamOptimizer(.1).minimize(loss)

This optimizes pretty well, and takes ~50s for 1000 iterations. Here is the log-like by iteration plot which shows that we can get away with optimizing for probably 200 iterations
image

After training, we can save the hyperparameters and use them for recreating the GP and sampling from it, very similarly to how the current code works. I wish the weights were easily to visualize with plots, so I could show the process. Something I could definitely work on.

Habib

This is a little more complex, but works about the same

def habib_joint_log_prob(lam_xi, a, len_scales):
    """
    habib lnlike
    """    
    gparams = pca.gparams
    PhiPhi = pca.PhiPhi.astype('f8')
    w_hat = pca.w_hat
    M = pca.M
    m = pca.m
    
    conf_priors = np.array(Starfish.config.PCA['priors']).T
    
    # Put Gamma prior on the length scales
    rv_scale = tfd.Gamma(
          concentration=conf_priors[0],
          rate=conf_priors[1],
          name='length_scale_prior')


    # Define ARD kernel and GP likelihood over observations
    kernel = InputScaledKernel(tfk.ExponentiatedQuadratic(a),
                             len_scales,
                             name='ARD_kernel')
    sigs = kernel.matrix(gparams, gparams)
   
    # This is garbage that I have to do to create a block diagonal
    paddings = tf.constant([[0, (m-1)*len(gparams)],[0, 0]])
    Sig_w = tf.pad(sigs[0], paddings)
    for i in range(1, m):
        paddings = tf.constant([[i*len(gparams), (m-i-1)*len(gparams)],[0, 0]])
        zeros = tf.pad(sigs[i], paddings)
        Sig_w = tf.concat([Sig_w, zeros], 1)
        
    C = (1. / lam_xi) * PhiPhi + Sig_w
    
    sign, pref = tf.linalg.slogdet(C)
    rhs = tf.linalg.solve(C, np.expand_dims(w_hat, -1))
    central = tf.matmul(tf.expand_dims(w_hat, 0), rhs)
    lnp = -0.5 * (pref + central + M * m * np.log(2. * np.pi))
    
    return tf.dtypes.cast(tf.squeeze(lnp) + tf.reduce_sum(rv_scale.log_prob(len_scales)), tf.float64)

lam_xi = (np.finfo(np.float64).tiny +
             tf.nn.softplus(tf.Variable(1., dtype=tf.float64), name='lambda_xi'))
amplitude = (np.finfo(np.float64).tiny +
           tf.nn.softplus(
               tf.Variable(10. * tf.ones(pca.m, dtype=tf.float64)),
               name='amplitude'))
length_scale = (np.finfo(np.float64).tiny +
               tf.nn.softplus(
                  tf.Variable(np.tile([300., 30., 30.], (pca.m, 1)), dtype=np.float64),
                  name='length_scale'))

loss = -habib_joint_log_prob(lam_xi, amplitude, length_scale)
opt = tf.train.AdamOptimizer(.1).minimize(loss)

So, the main divergences from the current habib implementation is that I've opted to use an ARD Kernel from TensorFlow to get the covariance instead of Starfish.covariance.k. I then have some garbage to create the block diagonals (TF doesn't have a tf.block_diag, but I've submitted a feature request), and finally, I simply swap out the numpy functions to the tensorflow functions for the linear algebra.

Here is a plot of the loss function per iteration as before-

image

This method would require sampling weights the "hard" way by calculating the many matrices and sampling. Or perhaps we can use the optimized parameters and use the available GP code (like above) for sampling, since it offers a GP.sample() method.

Notes

Using Tensorflow as a backend means that none of the above code actually evaluates any code. Instead, it creates a graph of every operation and optimally distrubes them among the CPU and GPU. To actually evaluate the code, a simple wrapper has to be used like

num_iters = 1000
loss_ = np.empty(num_iters, dtype=np.float64)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in tqdm.trange(num_iters):
        lx_, amp_, len_scales_, loss_[i], _ = sess.run([lam_xi, amplitude, length_scale, loss, opt])

In addition, there is a lot more concern for datatypes, as you can see in the above code, as well as special concerns for constraining the trainable values. Something that is super cool, though, is that these graphs that tensorflow makes can actually be visualized and interpreted (a la TensorBoard) and have built in methods for saving models and exploring them.

I do have an MCMC implementation using tfp.mcmc to explore the posterior of the hyperparameters. Unfortunately, it runs somewhat slowly and I would guess it could take an hour to get actual convergence. This could just be my poor understanding of MCMC samplers and options, but at first glance I'm not impressed by the performance. HOWEVER, as a proof of concept this shows that we can completely drop the emcee dependence from the emulator code.

Does this look good enough to bring into the emulator code and add TensorFlow as a dependence?

@mileslucas
Copy link
Member Author

mileslucas commented Dec 19, 2018

To extend this, I want to create a few plots showing samples from the GPs as well as some sort of timing evaluation, especially against the number of components of the PCA.

Update:
This is using the Naive optimization. Here is an example of a slice of the weights across a subset of T for each eigenspectrum.
Before:
image

After:
image

@iancze
Copy link
Collaborator

iancze commented Dec 19, 2018

This looks really exciting. I have a few questions, since I'm just getting acquainted with Tensorflow myself.

  1. The naive optimization (I think what we were calling emulator v1) looks pretty good to me. If the results you plotted for the weights are typical, then I'd say this is a good candidate for merging and providing the v1 option to users. Nice work!

I haven't yet had success locating what the functional form of the TF ARD kernel is. Do you happen to have a link?

  1. For the Habib implementation, I'm not sure I understand what you mean by sampling the weights the hard way. Does this mean that training the emulator will be more difficult? Or does it mean we can't marginalize the likelihood over the emulated spectrum when we go and run a fit to a spectrum?

Regardless, I'm super excited about this proof of concept. It seems like it could be a really powerful improvement for the code!

@mileslucas
Copy link
Member Author

@iancze Thanks :) . I'm actually going to try and see what PyTorch implementation would look like. The TF code is a lot of boilerplate and PyTorch might end up with significantly cleaner code.

As for the the TF ARD Kernel you can't find it because I had to custom make it with the help of a TFP engineer. See here for discussion and implementation. PyTorch's Pyro (the equivalent relationship as Tensorflow and Tensorflow Probability, it seems) has a native ARD Kernel for their GP module.

I think I am mistaken in what I mean by the "hard way". As I look at it again, the likelihood is just used for training, and then we can use the trained hyperparameters in the standard GP to sample our weights. Correct me if I'm wrong on that.

@mileslucas
Copy link
Member Author

I've gotten up and running with PyTorch/Pyro and so far I much much much prefer it to TensorFlow. Here are some examples of doing a Naive fit-

import torch
import pyro
import pyro.contrib.gp as gp
import pyro.distributions as dist

grid = HDF5Interface()
pca = PCAGrid.create(grid)

X = torch.from_numpy(pca.gparams)
y = torch.from_numpy(pca.w)

amplitude = torch.tensor(20.)
lengthscale = torch.tensor([300., 30., 30.])

kernel = gp.kernels.RBF(input_dim=3, variance=amplitude, lengthscale=lengthscale)
gpr = gp.models.GPRegression(X, y, kernel, noise=torch.tensor(1.))
optimizer = torch.optim.Adam(gpr.parameters(), lr=0.01)
loss_fn = pyro.infer.Trace_ELBO().differentiable_loss
losses = []
num_steps = int(1e4)
for i in tqdm.trange(num_steps):
    optimizer.zero_grad()
    loss = loss_fn(gpr.model, gpr.guide)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())

Here are the fits:
image

The biggest problem I have right now is that I don't currently have an easy way of batching this. In other words, Its fitting one GP for all weights instead of (in this example) 4 independent GPs. I have opened a discussion with Pyro to discuss this. As far as speed, on my CPU (8 core) it performs about 50 optimization steps per second. For the 10,000 steps above it takes me ~3 minutes. Which is phenomenal

@mileslucas mileslucas reopened this Dec 20, 2018
@mileslucas
Copy link
Member Author

Update- I have the GP working with priors properly but not batching correctly (so it still fits 1 GP broadcasted to m weight functions). The best fit parameters seem a lot more appropriate than what I was seeing beforehand. Also this is a very pretty plot and I wanted to share it-
image

@iancze
Copy link
Collaborator

iancze commented Dec 20, 2018

This looks pretty promising!

You're absolutely write about the emulator likelihood being used just for training. Once you know the GP hyperparameters, the likelihood is not used again (for both v1 and v2 versions).

I'm not at all familiar with PyTorch, so I'm a little confused about your comment that you can't fit things independently. Can you set up a separate 3D Gaussian process (Teff, log g, [Fe/H]) for each eigenspectrum weight separately? Once you have the hyperparameters trained for each GP per weight, then you're done, right? When you query the emulator for a spectrum, you just need to ask each GP for what mu and sigma is at the stellar parameters (the posterior predictive distribution), and then these go into Eqn 49 + 50 (of v2 starfish paper, or A17 of v1 starfish paper) to calculate the likelihood for a given set of stellar parameters.

I'll write more notes about the Habib et al. likelihood function on #119 in a moment.

@mileslucas
Copy link
Member Author

So, to clarify, what I'm currently doing is training ONE Gaussian process to fit all of the eigenspectra weights. This means there is one amplitude and one length-3 lengthscale parameter fit for all the data points. This is what is produced above, and IMO doesn't seem like a bad fit. I could fit m separate GPs but I was hoping for a less tedious way of achieving the same effect within Pyro.

@iancze
Copy link
Collaborator

iancze commented Dec 21, 2018

Ah I see. I think this could work, but will probably end up sacrificing a bit of accuracy in the interpolated spectra, since the amplitudes will need to be different for each weight. For example, in your plot above, the GP looks pretty good for w0, but will be conveying too much uncertainty in all the other eigenspectra weights. If you wanted to reduce dimensionality it might work if you used similar length scale parameters for all weights but used a different amplitude for each one.

@mileslucas
Copy link
Member Author

Okay, so I've trained (in my case) 6 GPs completely separately, using a for-loop.

amplitudes = 20 * torch.ones(pca.m)
lengthscales = torch.tensor([300., 30., 30.]).expand(pca.m, 3)

kernels = [gp.kernels.RBF(input_dim=3, variance=amplitudes[i], lengthscale=lengthscales[i]) for i in range(pca.m)]
priors = torch.tensor(
    [[2, 2, 2], 
    [0.0075, 0.075, 0.075]]
)
[kernel.set_prior("lengthscale", dist.Gamma(priors[0], priors[1]).to_event()) for kernel in kernels]
[kernel.set_prior("variance", dist.Uniform(10, 150)) for kernel in kernels]

gprs = [gp.models.GPRegression(X, yi, kernel, noise=torch.tensor(1.)) for yi, kernel in zip(y, kernels)]

optimizers = [torch.optim.Adam(gpr.parameters(), lr=0.01) for gpr in gprs]
loss_fn = pyro.infer.Trace_ELBO().differentiable_loss
num_steps = int(1e4)
losses = np.zeros((pca.m, num_steps))
for i, (gpr, opt) in enumerate(zip(gprs, optimizers)):
    for j in tqdm.trange(num_steps):
        opt.zero_grad()
        loss = loss_fn(gpr.model, gpr.guide)
        loss.backward()
        opt.step()
        losses[i][j] = loss.item()

Losses
image

Fits
image

The parameters for each kernel are independent. I really like how these plots look I'm not sure why. I ought to be able to find a way to vectorize these cheaply using the torch.multiprocessing to process each kernel independently but still parallelized.

@iancze
Copy link
Collaborator

iancze commented Dec 21, 2018

Yes, this looks much better, and I think completely reproduces the v1 emulator. About how long did the training take?

@mileslucas
Copy link
Member Author

each 1e4 fit took ~ 2:45 on my machine so about 15 minutes total for 6 eigenspectra. As you can see from the losses this can probably reduced to 5e3 steps each cutting the time in half.

@iancze
Copy link
Collaborator

iancze commented Dec 21, 2018

That's great! If you're happy with the PyTorch version, then I'd say this is a great candidate for merging into the codebase as sort of the go-to emulator version. Folks who need the complexity of the Habib emulator can spend more time training it, but for most inference problems I think this faster version will suffice.

@mileslucas
Copy link
Member Author

mileslucas commented Dec 25, 2018

@iancze

I've been doing a lot of testing using PyTorch modules for replacing the PCA and Emulator class.

I think, actually, that I'd prefer to maintain the Habib likelihood with the hopes that, with enough speed, it is not too painful to train. I've done a lot of prototyping in a jupyter notebook, and so far I've basically driven myself insane with tensor math but I've grown more comfortable with tensors.

Timing

I've done some benchmarking of a version of a Habib likelihood function I've managed to write in PyTorch, with and without CUDA speedup. Below is a table of the speeds I've tested at multiple eigenspectra components (all times in ms)

n_comp 1 2 3 4 5 6 7 Avg. per n
numpy 263 496 744 1000 1250 1520 1790 252
PyTorch 7.23 30.8 68 123 203 305 457 30.4
PyTorch w/CUDA 22.3 48.2 116 249 425 694 1070 71.5

or in beautiful graph form-
image

System Specs
python version 3.6.1 (Anaconda)
CUDA version 9.0
CPU Intel i7 6700k @ 4.7 GHz
RAM 16 GB @ 3200 MHz (dual channel)
GPU Geforce GTX 1080 8GB

There is a significant speed up in the PyTorch version, however the GPU code is actually slower than the CPU code. This is not a straightforward representation of the time to minimize, because the gradiant approximations by the SciPy minimizer and the PyTorch mnimizer require different calcuations. In my testing onn=6 PCA optimization, the time per iteration for each independent GP in the Naive implementation was around 10ms while the time per iteration in the Habib likelihood (which fits all GPs effectively at the same time) was around 1.5s per iteration.

Implementing

I think we could GREATLY improve the speed of the habib likelihood if we could rewrite the equations using tensor math so that we can use more built-in methods of PyTorch which likely have LAPACK or BLAS implementations instead of our for-loops and Cython implementations.

My current verdict is that I think we should attempt conversion of the PCA and Emulator classes to PyTorch, based on the dramatic speed improvements. One huge disadvantage of this (in my opinion) is that it significantly decreases the ease of access for users. To install PyTorch, you have to choose a version that is compliant with whatever version of CUDA (or no CUDA) is on that specific machine. The pip package only points to one version of about 12 possible active versions. This means we have to pass the installation on to the users and throw errors if PyTorch isn't installed. This is suboptimal in my opinion but I can't see a way of easily detecting a user's CUDA version from within python.

Do you think it is appropriate for use to move towards PyTorch?

Questions

I am running into a lot of issues in implementing the Emulator code in PyTorch. The V12 augmented matrix for use in the joint multivariate Gaussian is not straightforward and I can't seem to get a version working that takes advantage of tensor math instead of iterating one by one through the entire kernel. Can you shed some light on the matrix operations that define the construction of V12?.

More importantly, above we clarified that the Habib method could only be used for optimizing the covariance kernel parameters, however the literature seems a little different because there is no dependence on lam_xi with the covariance kernel, implicitly. We could consider lam_xi to be the noise of the Gaussian likelihood of the GPs (effectively adding it to the diagonal of the covariance matrix), but this is not how lam_xi is optimized in the Habib likelihood (it is the noise of the reconstructed flux likelihood, rather than the weights likelihood). So, do we have to represent the conditional of P(w|w^) as a joint distribution or can we get away with it being P(w | w_grid) and just use the standard kernel?

Happy holidays, as well!

@mileslucas
Copy link
Member Author

mileslucas commented Dec 27, 2018

I've also been thinking about the overall organization of the emulator code. As of right now, I think we should not put as much emphasis on the PCAGrid class. I think the two points where users should work is either a GridInterface + Interpolator or an Emulator. The intermediate PCAGrid class is merely a wrapper for the linear algebra that facilitates the PCA and ought to only be used in the Emulator.

This doesn't mean we can't have a class to do the work, but I don't think we should be saving the PCAGrid to disk but rather we should be saving emulators to disk. At a bare minimum, the only things we want to persist for the emulator are the kernel hyper-parameters.

I don't believe we need to actually store the eigenspectra/weights on disk- I did a toy test by PCA decomposing the same set of grid fluxes 100 times and analyzed the weights to look for differences, I did a test along the lines of

# This is a new PCA that I've written in PyTorch ~2 times faster than SKlearn 
# (and eliminates a dependency)
pca = PCA(n_comp=4) 
eig, weights = pca(fluxes) # fluxes is from my HDF5Interface
for _ in range(100):
    _, w = pca(fluxes)
    np.testing.assert_array_equal(w, weights)

and it passed without assertion errors.

If we rewrite the emulator in PyTorch all of the hyperparameters of the model exist in a dictionary for the emulator. This would make it extremely easy to save to a JSON like you've done for the sampler parameters.

So, my proposal would look a bit more like this-

# untrained
grid = HDF5Interface()
emu = Emulator(grid, filename='test_emu.json') 

At this point, the Emulator would do the PCA decomposition of the grid, calculate any static values like v11, PhiPhi, w_hat, etc. and instantiate any other parameters (such as the kernel hyperparameters).

From here

# Try calling untrained emulator
fl = emu.load_flux([7023, 4.122, 0.001])
> ValueError: Before calling the emulator you must train it using `Emulator.train()`
emu.train(max_iter=1000, max_processes=10, cuda=False)

The training loop could either save parameters to disk every X iterations (and provide some resume functionality) or only after training is complete (at some standard TBD). So, the user does not have to explicitly use a write function to save the parameters if we don't want to. Internally there can be some is_trained attribute to check if something is trained or not, although we may want this to be more than just a boolean set if train is called to avoid emulator usage at poor optimization.

We can use a similar open class method like this-

# Reads the parameters and path to grid file from JSON, still recalculates the PCA, PhiPhi, etc. 
# This should be okay since instantiating an emulator shouldn't occur anywhere time-sensitive
emu = Emulator.open('test_emu.json')

Finally

fl = emu.load_flux([7023, 4.122, 0.001]) # Stay consistent with interfaces
# Formally the Emulator is a GP so calling it directly should act as such
mu, cov = emu([7023, 4.122, 0.001]) 

@iancze
Copy link
Collaborator

iancze commented Dec 27, 2018

This is great to hear, and thanks for all of the work. Happy Holidays to you (and all our github watchers :) ) as well!

[this is response to two comments ago]

v1 emulator
You showed a nice implementation of this a few posts back. Given the install difficulties with PyTorch that you mentioned, I was wondering how hard this might be to implement this with a simple GP without a PyTorch dependency? Perhaps via something that's a bit easier to install, whether it be celerite, george, scikit-learn, or even the custom Starfish implementation formerly in covariance.py? This stripped down emulator might be a nice thing to include in a v0.3.0 release that most users could use without needing to worry about installing PyTorch if they might not need it.

PyTorch
That said, I think moving to support PyTorch would be a worthwhile research goal, especially given the speedups you have shown with the Habib et al. likelihood function. I think that this currently makes sense as an optional addition to the v1 emulator, and if it's so fast, it makes sense as the primary implementation of the Habib et al. likelihood function.

Regarding the tensor notation, if you could write your implementation up in a tex document somewhere this might make it easier to figure out how things fit together and we could iterate on an implementation. I'm a bit rusty with tensor math but maybe this well help us converge on the best way to calculate the Habib likelihood with something like Tensorflow or PyTorch.

Regarding V12, let me refresh my memory with what I wrote 3+ years ago...

"The Mm × m matrix V12 (and its transpose V21) describe the relations between the set of
parameters in the grid and the newly chosen parameters to interpolate at theta_star. The structure of V12 is set by evaluating [the emulator GP kernel] k_i (Equation (29, ApJ version)) across a series of rows of {theta_star}^grid like in Sigma^grid, for i = 1, 2, ..., m, and across m columns of theta_star."

Ok, this is a lot of notation jargon just to explain what's going on. Essentially, we're just doing standard GP posterior prediction of the eigenspectra weights onto a single point, theta_star, conditioned on all of the existing eigenspectra weights measured at the grid points. But, because the emulator output is multidimensional (there are m eigenspectra weights) that's why this formulation has some extra rows/columns compared to the standard GP prediction operation to a single point (see Rasmussen and Williams, 2005, eqns 2.21 - 2.24).

The lam_xi term is called a "nugget term" and is basically the white-noise scatter of the emulator, in flux space, like you mentioned. I'm not sure I fully understand your comment about the conditional and the standard kernel. The math leading up to Eqn 42, ApJ is introducing the GP prior to the grid of eigenspectra weights, and then marginalizing over all possible GP weights (i.e., getting the GP marginal likelihood). The thing that we actually need to calculate is Eqn 42 (and the priors, Eqn 43, 44, and 46). The whole posterior is put together in Eqn 45. Is there a fast way to implement a standard GP hyperparameter optimization in PyTorch and this is why you are asking about a standard kernel?

@mileslucas
Copy link
Member Author

For the last point-

The "standard kernel" is the same as A.10 (arxiv) used to create block matrix Sig_w (eqn. A.13). Our hyperparameters train the amplitude and lengthscale of this standard kernel (across all batches and feature dimensions) in addition to the lam_xi. My initial understanding was that we could train these kernel parameters using the Habib likelihood and then just throw them into a standard GP (eqn. A13) for sampling the posterior function (i.e. for getting the weights once trained). My current understanding is that the Habib implementation samples the posterior function using a conditional likelihood that includes lam_xi and PhiPhi (arxiv eqn. A.30) which means we can't just naively throw our trained hyperparameters into a normal GP. Unless we can, which is my confusion.

@iancze
Copy link
Collaborator

iancze commented Dec 27, 2018

[this is in response to your new but not newest comment]

I've also been thinking about the overall organization of the emulator code. As of right now, I think we should not put as much emphasis on the PCAGrid class. I think the two points where users should work is either a GridInterface + Interpolator or an Emulator. The intermediate PCAGrid class is merely a wrapper for the linear algebra that facilitates the PCA and ought to only be used in the Emulator.

This is great that you are thinking about reorganizing the emulator / interpolator into a more parallel type of functionality. In general, I think your concepts are great and JSON saving is always a good idea. (Though, I always wonder if there are better scientific formats for this kind of stuff). I have a few comments about envisioned use cases and how a user might go about training things.

PCA decomposition
It's true that this actual process can be pretty quick, and in theory the eigenspectra don't need to be stored on disk. However, for analysis purposes, we definitely should have a use pattern where actual eigenspectra can be stored to disk for plotting purposes and to analyze what spectra features they might actually tracking. People might want to experiment with how they are tuning the emulator by varying the number of eigenspectra weights, and having access to them in a persistent manner (with the settings used to generate the eigenspectra) is really important here. Also, if PyTorch goes and changes its PCA algorithm from one release to the next (or maybe even just default parameters), we won't have a way to verify reproducibility of eigenspectra.

Second, the pattern in which I used Starfish was the following. Store ~5 Gb library of synthetic spectra on my home disk somewhere. Use PCA to decompose these into eigenspectra, and train the emulator. Then, transfer the eigenspectra and hyperparameters to a cluster where I fit an actual spectrum. In that case, a trained emulator (or even just the PCA spectra themselves) are very easy to transfer from a staging computer to HPC without needing the full synthetic grid. There may be better use patterns but at least I found this convenient.

Some other patterns to consider
Say a user trains an emulator based on some subset of their synthetic grid. Then, they do a fit of a spectrum and find that their posterior butts up against the edge of their subset, and they need to retrain a new emulator. Is there anything they can easily use from the previous training? Maybe the answer's no, but it feels like there would be some hints for GP hyperparameter settings.

Or, say a user trains an emulator using m eigenspectra weights. Then, they decide that they need to be more (or less) accurate and want an emulator with m + 1 or m - 1 eigenspectra. Is there anything they can reuse from the first training?

[response to newest comment]

For the last point-
The "standard kernel" is the same as A.10 (arxiv) used to create block matrix Sig_w (eqn. A.13).

The GP hyperparameters determined from training using A.23 (and kernel equation A.10) are the same ones used to do emulation via eqn A.30. I think everything is consistent here among the training and utilization steps. Hopefully that makes sense but please keep asking questions if I'm struggling to make clear what's going on.

@mileslucas
Copy link
Member Author

mileslucas commented Dec 27, 2018

It would be just as easy to ship an emulator as HDF5 with the eigenspectra built in, too. I wanted to try and see what the bare minimum model-persistence would require. So I think we can still structure this such that you create a grid.hdf5 for your subset of your favorite model library, then you create an emulator.hdf5 rather than a pca_grid.hdf5.

I believe that adding extra grid points should not affect the GP hyperparameters if they are close to the original grid. Since we are using a very smooth GP Kernel, the hyper-parameters we've trained should pretty well define the function-space of the grid subset, a sort of psuedo-global optimum. When we actually use the GP to sample a function and get our weights, this function is sampled conditioned on the input data (training data). By adding extra training points after training, the function space is automatically constrained due to the kernel being conditionally sampled on the training data. I suspect this would cause issues too far from the original grid subset, but this is all a naive guess anyways.

@iancze
Copy link
Collaborator

iancze commented Dec 27, 2018

+1 for the emulator.hdf5 idea. I think that makes a lot of sense.

I think you're probably right on the hyperparameter settings, but, the reason I was thinking there might be a substantial change is because even if you keep m total eigenspectra, if you expand your subgrid slightly, then the eigenspectra themselves will change. Thus, the GP hyperparameters might change as well even if you've only added a few spectra on the edge of the grid. If these spectra corresponded to a temperature range where some new important spectral line or feature was being introduced, this could substantially change the eigenspectra from the PCA. I guess we won't know for sure without some detailed tests, but back when I was constructing the emulator this seemed to have a big effect.

@mileslucas
Copy link
Member Author

that's a really good point I hadn't considered. It would be interesting to see the actual variability of the eigenspectra given different grid points around the same area. From this beautiful stack overflow answer The eigenspectra should contain the features that vary the most in the spectra, so how much will, say, the 0th eigenspectra change given one more? it might be drastic or not at all- definitely a good thought experiment.

@mileslucas
Copy link
Member Author

Also, on a separate note, it would be a great test of scalability if we could create an emulator for an entire model library (like all the PHOENIX Alpha=0 models) truncated to a certain instrument. That would be really cool to consider for the use-case of someone wanting to process many many different spectra all from the same instrument (this is one of my use-cases, actually). My gut, right now, tells me that actually creating the PCA decomposition wouldn't fail but trying to optimize would be a nightmare. Although if optimization takes a few days on a server, that might still be worth it if you then have a trained emulator for any possible spectrum you might use.

@iancze
Copy link
Collaborator

iancze commented Dec 27, 2018

I think your intuition on the scaling is correct. I mean, if you keep all of the PCA components you can reconstruct the grid flawlessly. The problem is a judgement call of how many do you need to keep to get the accuracy you desire vs. how few you can get away with, since the training time for Habib goes as ~ m^3 I think.

Limiting the spectra to a narrow wl range (overlap w/ the instrument you're using, for example, or even a single echelle order) can also increase accuracy with fewer eigenspectra since there are fewer spectral features needed to explain by the eigenspectra. This training scaling is much more favorable with the v1 emulator since the GPs are independent and we are just training on weights, rather than the flux pixels of the synthetic library.

@mileslucas
Copy link
Member Author

Long time since I've visited this issue, but refactor/core-numpy includes a fully (as far as I can tell through inspection and testing) functional rewrite of the emulator. Documentation will be included to showcase the new operation modes, alongside a jupyter notebook example that provides some more insight into the background of the emulator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants