Attempt at concatenating addtional layers on top of the RGB channels of an image to see if image synthesis (particularly on CelebA-HQ) could be improved.


What I ended up trying was concatenating the pixel-coordinates (normalized to $[-1, 1]$) in an attempt to make it aware that a pixel in the centre of an image typically belonged to a person’s face, maybe a person’s eye.

  • We hope that this additional information/context will aid in synthesis tasks.

Discussion: “Concat-UNet” (Saharia et al. [arXiv])

This paper was mentioned in “Novel View Synthesis with Diffusion Models” by Ho et al. [paper page] [arXiv] [my notes on the topic] under the X-UNet discussion, and I thought it was interesting to see that such experiments were already tried (albeit in super-resolution instead of synthesis).

Our UNet only goes down to 16x16 at the bottleneck, rather than the recommended 4x4 in the DDPM paper [Ho et al. 2020 arXiv]. Originally, this was due to memory constraints but has been fixed since; I decided to train the coordinate based network using the same bottleneck as I already had the original model trained. This probably explains the very large FIDs below.


I trained two models, with and without concatenation, on Texas A&M HPRC’s Grace cluster for two days with two A100 GPUs, without mixed precision.


Computed using pytorch-fid.

With concatenation of coordinates:
No concatenation of coordinates: