This post sheds light on uncertainty in deep learning models. We all realise that deep learning algorithms grow in popularity and use by the greater engineering community. Maybe a deep learning model recommended this post to you, or the spotify song you are listening on the background. Soon, deep learning models might rise in more sensitive domains. Autonomous driving, medical decision making or jurisdiction might adapt models too, but what about the uncertainties that the models introduce in such applications?
Training a neural net gets you a point estimate of the weights. Testing a neural net gets you one sample of a softmax distribution. Think about the knowledge being lost in those two steps. Moreover, we know that these softmax outputs can be easily fooled. An imperceptible change might change a classification from schoolbus 0.95 to Ostrich 0.95. This project will not focus on these adversarial examples, but their existence motivates us to consider a more elaborate view on neural networks.
This project will compare on three approaches to uncertainty in neural networks. I talked to various researchers over the past months. There is no conclusion for one approach for obtaining uncertainties. However, all researchers agreed these three approaches point in the right direction.
Our three approaches are bootstrapping, MCMC and variational inference. Before we dive into the details of each, this section will sketch an overarching structure in which to understand these approaches.
The bootstrap follows from the assumption that there is one correct parameter for our model and we estimate it from a random data source. How can we use this assumption to compute the uncertainty in our parameter? We will subsample the training set many times and estimate one parameter from each. We are uncertain about our model, so we maintain this set of estimated parameters. At test-time, we average the outputs from the model with each parameter as our prediction. The variance in our outputs represents the uncertainty. chapter 8 of elements of statistical learning explains the bootstrap in more detail
Another way to view the learning proces is that there is one dataset and we learn a distribution over our model parameters. Bayes rule exemplifies this reasoning. We distill our knowledge of the world in the prior. In the likelihood we update the distribution according to the data. This gives rise to a posterior distribution over parameters. However, this distribution is intractable to compute. Therefore, we resort to two approximations for this process.
- _Monte Carlo sampling Rather than evaluating the distribution, we draw samples from it. Then we can evaluate any function of the distribution via these samples.
- _Variational inference Rather than evaluating the distribution, we find a close approximation to it. This approximation will have a form that we can easily perform calculations over.
Both approximations come with disadvantages. For Monte Carlo methods, our estimate may vary from the expected value if we have few samples. More samples will reduce this variance. For variational inference we will exactly find the best approximation. However, we will not know if there is a bias between our approximate distribution and the true distribution. In other words, Monte Carlo methods have variance, variational inference has bias.
All three approaches results in multiple samples of the parameter vector. Our interest lies in the output for a sample and its uncertainty. How do we get these quantities from the parameter samples?
Our model outputs a softmax distribution. Therefore, we take the average over all these softmax distributions.
For many applications, we need a decision. This will be the bin with the highest softmax value,
Our estimate of the uncertainty is less clear. We are working with softmax distribution which has no common uncertainty number associated. In the literature, I came across three options
- Softmax value: in this case, the value of the softmax at the decision is used to represent uncertainty. so
- Variance in the softmax: in this case, the variance of the softmax values in the different outputs is used to represent uncertainty. Define the set of all softmax values, . Then the uncertainty is the variance in this set, . Section 4 of this paper
- Entropy in the average softmax: in this case, the entropy of the average distribution represents the uncertainty. So . Section 5.3 of this paper
In this project, we implement all three of them.
In bootstrappping, we sample multiple datasets with replacement. The Dataloader
object has a function bootstrap_yourself
to resample the training set for a bootstrap. Then the model is trained num_runs
times to obtain the set of parameters
We use Langevin dynamics to obtain samples from the posterior over parameters. This implementation exactly follows this paper by Teh and Welling. After a burn_in
period, it will save a parameter vector every steps_per_epoch
steps.
Honestly, I currently lack some understanding of the variational approach. The implementation follows the papers here and here. At the moment, I understand this literature as fundamental approach that leads to an intuitive implementation. We are all familiar with dropout and its dropping of weights in a neural network. We can interpret this as fitting a two spike distribution to the parameter posterior (per weight) while constraining one spike at zero. We obtain samples from this distribution by sampling from these spikes. That amounts to running the model many times with different dropout masks. I hope to update this section if I gain more understanding. The researchers I chatted with on this project also pointed me to this paper
So how to assess uncertainty in image classification? There is no uncalibrated measure of uncertainty for any image, as that would assume a model of the full (history of) the world. However, we can assess images for which we know that uncertainty increases. We take two approaches, injecting Gaussian noise or rotating the image.
We experiment with different noise levels or angles of rotation and record the corresponding uncertainty metrics. At perturbation method, we take num_experiments
experiments on batch_size_test
images.
These diagrams plot the risk numbers against the experiment variable. For differen injected noise and different rotation angles, we see the entropy, mean and standard deviation of the softmax. You can make this diagram with plot_risk.py
We also want intuition for the mutilation and its effect on the uncertainty. Therefore, we made these GIFs where the mutilations increase. Red and green titles indicate incorrect/correct classifications.
In these results, there are some interesting observations
- When rotating the images, the error quickly shoots up. At 90 degrees rotation the model misclassifies 80% of the images. It's interesting to see how the uncertainty numbers behave under such large error.
- The entropy of
mc_dropout
is larger than the other two MC types. In parallel, we notice that its mean softmax value is lower. - Even though the entropy and mean softmax of Bootstrapping and Langevin samples are comparable, the standard deviation is lower.
At this point, we leave many open ends for this project. No researcher I contacted on this expressed a conclusion on the uncertainties in neural networks. Lots of research needs to be done in this area. I hope these diagrams give you a starting point to think about these questions too.
As always, I am curious to any comments and questions. Reach me at [email protected]
- The original paper to propose dropout variational functions to approximate the posterior Dropout as a Bayesian approximation
- Outlining the difference between aleatoric and epistemic uncertainty What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?
- Nice Reddit thread on discussing validity of dropout as an estimator of epistemic uncertainty [D] What is the current state of dropout as Bayesian approximation?
- On Hamiltonian Monte Carlo MCMC using Hamiltonian dynamics
- The original paper to outline the sampling procedure in Langevin dynamics Bayesian Learning via Stochastic Gradient Langevin Dynamics
- Chapter 8 from the Elements of statistical learning on Bootstrapping