While a number of production ready algorithms exist such as:
- MOE,
- Spearmint,
- hyperopt,
- GPyOpt,
Simply put, the Bayesian Optimizers requires:
A Gaussian process- sklearn.gaussian_process as gp
-
From which we will predict the posterior distribution of the target function (function of errors for models).
-
Because a Gaussian distribution is defined by its mean and variance, this model relies on the assumption that the Gaussian process is also completely defined by its mean function and variance function.
- Until math has another revolution and we discover that we know nothing about math (which I seem to find a lot) we can assume this is a pretty safe assumption (GP ~ mean function & variance function).\
A GP is a popular probability model, because it induces a posterior distribution over the loss function that is analytically tractable. This allows us to update our beliefs of what looks like, after we have computed the loss for a new set of hyperparameters.
https://thuijskens.github.io/2016/12/29/bayesian-optimisation/
An Aquisition function
- Which decides at which point in our target function, we want to sample next.
- A number of well documented aquistion functions exist, listed below:
Probability of Improvement
- Looks where a function's **improvement is most likely**
- Can lead to odd behavior because it relies on the current minimum, rather than the magnatude of possiblity of improvement.
- **Expected improvement (EI)** (also MEI?) - should confirm
- Looks where a function **may most improve**, aka *maximal expected utility*
- EI(x)=𝔼[max{0,f(x)−f(x̂ )}]
- Crowd favorite
- **Entropy search**
- Improves function by **minimizing the uncertainty** of any predicted optimium.
- **Upper Confidence Bound** (UCB)
- Looks where a function's **improvement is most likely**
- Looks to exploits possibly uncertainty by finding where the upperbound may be undetermined.
- **Maximum probability of improvement** (MPI)
- **PMAX**
- **IEMAX**
- **GP-Hedge**
As to follow the crowd this notebook will use the Expected Improvement function, for reasons I may revisit this notebook to explain.
With these two parts our program should:
- Given observed values of target function f(x), update the posterior expectation of f using the Gaussian Process.
- Find new_x that maximises the
EI: new_x = argmax EI(x)
. - Compute the value of f(new_x).
- Update posterior expectation of f
- Repeat this process for n_iterations
As we will ultimately be looking at hyperparameters, by treating the score or error as a function of the parameters. In this case we treat it as an optimization problem for an example function, finding the global minimum (for error).
Using 15 points, finding minimums.
GridSearching misses the true global min, as true existed outside of discrete boundaries RandomSearch also misses true global min, but could have been better than Grid
Generating priors(inital 5)
Updating process with each aquisition. Note the red line maps to the aquisition function on the bottom and refer to the where to look next
Bay finds global optimum, confirms it is global, and searchs for the true.
- Gridsearching would search the param space symetrically and systematically.
- thorough, inefficient, uniformity between samples may miss details.
- weak in higher dimensional space
- thorough, inefficient, uniformity between samples may miss details.
- Randomsearching would search the param space randomly.
- efficient, less thorough, reliant on sufficent iterations
- stronger in higher dimensional space
- efficient, less thorough, reliant on sufficent iterations
neither learn from previously selected elements in the parameter space.
- Bayesian however, does learn from previous elements, and works effectively with increased dimensional space.
Again, by treating the score or error as a function of the parameters. In this case we treat it as an optimization problem for an example function, finding the global maximum (for score). In this particular example, there are 2 maximums, which otherwise with grid and random searched would lead to confusion.
Using 25 points, finding minimums.
Generating priors (inital 5)
Updating process with each aquisition. Note the point on the aquisition map and refer to the where to look next
Bayesian (20 Points) - At this point, the assumed function is fairly accurate to the true
Again...
- Gridsearching would search the param space symetrically and systematically.
- thorough, inefficient, uniformity between samples may miss details.
- weak in higher dimensional space
- thorough, inefficient, uniformity between samples may miss details.
- Randomsearching would search the param space randomly.
- efficient, less thorough, reliant on sufficent iterations
- stronger in higher dimensional space
- efficient, less thorough, reliant on sufficent iterations
neither learn from previously selected elements in the parameter space.
- Bayesian however, does learn from previous elements, and works effectively with increased dimensional space.
https://towardsdatascience.com/shallow-understanding-on-bayesian-optimization-324b6c1f7083 https://thuijskens.github.io/2016/12/29/bayesian-optimisation/ https://github.com/fmfn/BayesianOptimization https://sheffieldml.github.io/GPyOpt/