From “Novel View Synthesis with Diffusion Models” by Ho et al. [paper page] [arXiv].

Diffusion models have proven useful in many generative tasks, including many image-to-image applications. Novel view synthesis is a image-to-image task that had not been investigated until Ho et al.

Major contributions from this paper:

  • Stochastic conditioning
  • X-UNet
  • 3D Consistency Scoring

Pose-conditional Diffusion Models§

We are interested in modeling the distribution

$$ q(\mathbf{x}_1, \dots, \mathbf{x}_m | \mathbf{x}_{m+1}, \dots, \mathbf{x}_n) $$

without a complete description of a 3D scene.

This means:

  • There are no guarantees that samples will be consistent with one another (especially with perfect knowledge).
  • Generated views can depend on on one another for 3D consistency.
    • Contrast to NeRF approaches that use rays with are conditionally independent with a given 3D representation of a scene.

Given distribution $q(\mathbf{x}_1, \mathbf{x}_2)$ of a pair of views at poses $\mathbf{p}_1, \mathbf{p}_2 \in SE(3),$ Ho et al. describe an isotropic Gaussian process that adds increasing noise to the data as the signal-to-noise-ratio $\lambda$ decreases:

$$ q(\mathbf{z}_k^{(\lambda)} | \mathbf{x}_k) := \mathcal{N}(\mathbf{z}_k^{(\lambda)};\sigma(\lambda)^{\frac{1}{2}}\mathbf{x}_k, \sigma(-\lambda)\mathbf{I}) $$

where $\sigma(\cdot)$ is the sigmoid function.

Then by reparameterization,

$$ \mathbf{z}_k^{(\lambda)} = \sigma(\lambda)^{\frac{1}{2}}\mathbf{x}_k + \sigma(-\lambda)^{\frac{1}{2}}\epsilon, \epsilon \sim \mathcal{N}(0, \mathbf{I}) $$

Then, they learn to reverse the process in one of the frames by optimizing:

$$ L = \mathbb{E}_{q(\mathbf{x}_1, \mathbf{x}_2)}\ \mathbb{E}_{\lambda, \mathbf{\epsilon}} ||\mathbf{\epsilon}_\theta(\mathbf{z}_2^{(\lambda)}, \mathbf{x}_1, \lambda, \mathbf{p}_1, \mathbf{p}_2) - \mathbf{\epsilon}||_2^2 $$

where $\mathbf{\epsilon}_\theta$ is the predicted noise added to the sample and $\mathbf{\epsilon} \sim \mathcal{N}(0, \mathbf{I})$ is the actual noise added. In the paper, Ho et al. write $\mathbf{\epsilon}_\theta(\mathbf{z}_2^{(\lambda)}, \mathbf{x}_1)$ as a shorthand.


Practically, this means that given two frames from a 3D scene and their poses, we attempt to remove the noise added to one of the two frames.

The network is trained to predict the same noise $\mathbf{\epsilon}$ that was added in the forward process.

Stochastic Conditioning§

Ideally, a 3D scene can be modelled with frames following the chain rule decomposition:

$$ p(\mathbf{x}) = \prod_i p(\mathbf{x}_i | \mathbf{x}_{< i}). $$

This allows them to model a distribution without making assumptions about conditional independence.

This means that each frame can be generated autoregressively by conditioning on all previous frames. Unfortunately, this is impractical – so they suggest to use a $k$-Markovian model (keep the $k$ previous frames).

They found that as $k$ increases, sample quality goes down; experimentally, $k = 2$ still has some 3D consistency.

Instead of using an actual Markovian sampler, they vary the conditioning frame at each denoising step.


  1. Start with a set $\mathcal{X} = {\mathbf{x}_1, \dots, \mathbf{x}_k}$ of conditioning views of a static scene, $k$ typically $1$ or very small.

  2. Generate a new frame $\mathbf{x}_{k+1}$ using a modified denoising diffusion reverse process for steps $\lambda_{\text{min}} = \lambda_T \lt \lambda_{T-1} \lt \dots \lt \lambda_0 = \lambda_{\text{max}}:$ $$ \begin{align*} \hat{\mathbf{x}}_{k+1} &= \frac{1}{\sigma(\lambda_t)^{\frac{1}{2}}}\left(\mathbf{z}^{\lambda_t}_{k+1} - \sigma(-\lambda_t)^{\frac{1}{2}}\mathbf{\epsilon}_\theta(\mathbf{z}_2^{(\lambda)}, \mathbf{x}_1)\right)\\\\ \mathbf{z}_{k+1}^{(\lambda_{t-1})} &\sim q\left(\mathbf{z}_{k+1}^{(\lambda_{t-1})} | \mathbf{z}_{k+1}^{(\lambda_t)}, \hat{\mathbf{x}}_{k+1}\right) \end{align*} $$

    where $i \sim \text{Unif}\{1, \dots, k\}$ is resampled at each step, allowing each step to be conditioned on a different random view from $\mathcal{X}.$

  3. Let $\mathcal{X} := \mathcal{X} \cup \{\mathbf{x}_{k+1}\}.$

  4. Repeat until each generated frame is able to be guided by all previous frames. Ho et al. suggest a number of 256 steps for high sample quality and 3D consistency.

Notes: As per usual, $\mathbf{z}_i^{(\lambda_T)} \sim \mathcal{N}(0, \mathbf{I})$ and $\lambda_0$ is sampled noiselessly.

Discussion: More about the relation of stochastic condition with true autoregressive sampling found in $\S2.2.$


Unfortunately, previous works such as “Concat-UNet” (Saharia et al. [arXiv]) yielded samples with severe 3D inconsistencies and lack of alignment w.r.t. the conditioning image.

Discussion: Hypothesis of Concat-UNet failure – Limited model memory and training data makes it difficult to learn nonlinear image transforms based on self-attention.

X-Unet is based on the UNet proposed in the original DDPM paper by Ho et al. [arXiv]: a UNet with resblocks and self-attention. A modification is made by sharing weights over the two input frames for all convolutional and self-attention layers.

  1. Each frame has its own noise level, for the clean frame a positional encoding with $\lambda_\text{max}$ is used.
  2. Each UNet block is modulated via FiLM (Dumoulin et al. [paper page]), using the sum of pose and noise-level positional encodings (instead of just noise-level).
    • Additionally, pose encoding has the same dimensionality as the frames (i.e. they are camera rays).
  3. Instead of using a self-attention layer for time, use a cross-attention layer and let each frame’s feature maps use this layer to retrieve the other frame’s activations.

3D Consistency Scores§

It is difficult to compare this geometry-free novel view synthesis methods to other methods using the standard metrics (PSNR, SSIM, FID, etc.) because they cannot measure 3D consistency.

3D consistency scores does not penalize outputs that deviate from the ground truth but are also 3D consistent, but it does penalize outputs that are 3D inconsistent and do not align with the conditioning views.

They propose sampling many views from the geometry-free model from a single view, train a water-downed NeRF-like neural field on a subset of these views, then compute the usual metrics on the neural field renders against the remaining views from the geometry-free model.

To ensure that this metric penalizes against outputs that fail to align with conditioning views, they provide the condition views that were used to generate the rest as part of the training data for the NeRF-like model.