I recently gave a talk about [[score-based-models]], mostly based off this blog post.
At its core, generative models produce samples from a (data) distribution. The first distinction to be made is the principle by which you tackle the problem. Treating it as an adversarial problem gives you [[generative-adversarial-networks]]; treating it more probabilistically or likelihood-based gives you basically the rest. However, there's an additional distinction here that I think is underappreciated, which is how you go about producing the samples, which you can do either like a statistician or like a machine-learning person.
A statistician (or maybe more accurately a Bayesian) thinks of the classical approaches to this problem, namely using MCMC to sample from a distribution. That is, the goal is to set up a Markov Chain such that in the limit you have a random process that is effectively generating samples from the distribution. Alternatively, you can think of it as generating correlated samples (see this Wiki page). The issues here are how rapidly mixing is this process (or does it get stuck in some region); and maybe equivalently how correlated each sample is.
On the other hand, ML tries to learn a model that takes random noise as input, and then outputs an image. Here, the randomness comes from the input noise.^[A computer scientist would think of this more like a seed
.] You can effectively think of these models as doing something like what happens when you independently sample from a probability distribution (using Inverse transform sampling): you learn a map from a simple distribution (uniform or gaussian) to another distribution. You forgo the stochastic process for what could effectively be a deterministic function, and you don't have to worry about convergence. The slight downside is that creating multiple samples requires you to restart the whole process. Most generative models fall under this second category.
What's interesting (or confusing) is that you have this duality between score-based models and [[diffusion-models]], whereby they're effectively doing very similar things, except the biggest difference is that score-based models follow (at least theoretically) the Bayesian MCMC perspective, while diffusion models take the ML approach. Except, they're kind of the same thing, no? Or, at the very least, there was a lot of fruitful cross-pollination of ideas, which suggests that fundamentally they're doing roughly the same thing. So how does square with these fundamentally different ways of sampling?
If you look at how [[diffusion-models]] progress (this GIF gives a good visualization), it is clearly doing what it set out to do, which is to reverse a diffusion process. This makes it clearly in the ML camp. However, it was the architectural changes inspired by [[score-based-models]] that led to diffusion models being actually useable; that is, [@NEURIPS2020_4c5bcfec] showed that with some tweaking, the ELBO in the reverse diffusion process was (effectively?) identical to the loss from the noise-conditional score-based models. This suggests to me that somehow doing a combination of these two methods is the way to go. And actually, in some sense, that's what diffusion models are. The classic way of doing ML based generative models is to have a very complicated model but learn that transform in one go (see [[variational-autoencoders]]). It turns out that a better way to do sampling is to sort of do what traditional MCMC methods do, which is to build processes that slowly mix to the right distribution.
Note also that [[normalising-flows]] are also iterative, but they are mainly iterative by construction: you construct this transform as a composition of invertible functions, thereby making things tractible. In particular, there's no randomness involved, which I think is another crucial piece of the puzzle, that makes one think more about MCMC.
The classic application of MCMC is in calculating Bayesian posterior distributions, which can get pretty ugly, and somewhat high-dimensional. However, the more complicated, high-dimensional, the more difficult it was for MCMC methods to mix to the true distribution. However, we're still talking about (posteriors of) hierarchies of probability distributions.
Consider the modern distribution of a vector that represents the visualisation of a celebrity face. Like much of ML, we are very far removed from the world of nice probability distributions. They're incredibly high-dimensional. They probably live on some low-dimensional manifold, but that manifold is probably not like the proper manifolds we think of in mathematics.