Diffusion models have recently proven to be successful, and even rival previous methods like generative adversarial networks (GANs), in tasks like image synthesis.


The diffusion process as described in Denoising Diffusion Probablistic Models by Ho et al. [arXiv] consists of two processes: the forward and reverse process.

The forward process is a fixed process dependent on a set of hyperparameters, while the reverse process is learned using a modified UNet.


A very brief explanation of diffusion, taken from a personal email that I sent to my advisor (changes in parentheses):

In the forward process, we generate a new image at $t + 1$ by sampling pixel values from a Gaussian centred around the previous image; gradually the variance increases and the mean becomes more biased towards zero.

Because the Gaussians at each time step are independent their sum is also a Gaussian. This fact allows us to generate a noised image at time steps $t - 1$ and $t$ without requiring $t - 2$ and so on (making this a Markovian process), simplifying the training process like you mentioned.

As we reuse the same network for every time pair, we must somehow embed a timestamp with each image. This is done by encoding the time step using some function then embedding it into the data so the network can learn the time encoding (Sinusodial transformer positional embeddings from Attention Is All You Need by Vaswani et al [arXiv]).

The decoder is trained by minimising the $L_2$ distance between the expected mean and the predicted mean for time step $t$ (although, the $L_1$ has been shown to work as well in other tasks). We can observe that $\mathbf{x} \sim \mathcal{N}(\mu, \sigma^2I)$ is the equivalent to $\mathbf{x} = \mu + \sigma\epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$. Using the observation, we can reparameterize the learning to optimize for e instead of the mean.

Before you asked me why we couldn’t take large steps when predicting $\mathbf{x}$ given $\mathbf{x}’$. This is due to the fact that multiple paths can converge at $\mathbf{x}’$, so the distribution of $q(\mathbf{x} | \mathbf{x}’)$ would not have an obvious prior step (the problem is intractable). By using a smaller variance step, there would be a more prominent peak in the distribution which corresponds to the most likely prior.