Notations
$p(z|x)$: true posterior.
$q_\phi(z|x)$: approximate posterior with parameters $\phi$.
$x_0$: true data.
$x_t, t \in [1,T]$: latent variables.
$q(x_{t}|x_{t-1})$: forward process.
$p(x_t|x_{t+1})$: reverse process.
Understanding Diffusion Models from the Perspective of VAE
This section briefly summarizes (Luo, 2022) with my understanding, particularly focusing on the math logic of the diffusion models. The original article by Calvin Luo is highly recommended! It helps a lot!
Evidence Lower Bound
What are likelihood-based generative models? What are autoregressive generative models? The transformer family belongs to autoregressive models.
For likelihood-based generative models, we want to model the true data distribution by maximizing the likelihood $p(x)$ of all observed data $x$. We can think of the data we observed as represented or generated by an associated unseen latent variable, which can be denoted by random variable $z$. There are two direct but difficult ways to compute the likelihood $p(x)$:
-
$p(x) = \int p(x,z)dz$: intractable for complex models since it involves integrating all latent variables.
-
$p(x) = \frac{p(x,z)}{p(z|x)}$: difficult since it involves having access to a ground truth $p(z|x)$.
Here I have a question: How do you know the true joint distribution $p(x,z)$?
However, we can derive a term called the Evidence Lower Bound (ELBO) using the above two equations. So what is the ELBO? As its name suggests, it is a lower bound of the evidence. But what is the evidence? The evidence is the log likelihood of the observed data. The relationship between the evidence and the ELBO can be mathematically written as:
$$
\begin{align}
\log p(x) &= \mathbb{E}_{q_\phi(z|x)}[\log \frac{p(x,z)}{q_\phi(z|x)}] + D_{KL}(q_\phi(z|x)||p(z|x)) \\
& \geq \underbrace{\mathbb{E}_{q_\phi(z|x)}[\log \frac{p(x,z)}{q_\phi(z|x)}]}_\text{ELBO}
\end{align}
$$
Why the ELBO is an objective we would like to maximize?
-
The ELBO is indeed a lower bound of the evidence.
-
Any maximization of the ELBO term with respect $\phi$ necessarily invokes an equal minimization of the $D_{KL}(q_\phi(z|x)||p(z|x))$. Because the sum of the ELBO and the DL term is a constant with respect to $\phi$ (the evidence $\log p(x)$ does not depend on $\phi$).
-
As we will see below, the ELBO can be further dissect into several interpretable components which can be computed analytically or approximated using existing estimate methods.
Variational Autoencoders
In the vanilla VAE, we directly maximize the ELBO. The ELBO can be further divided into two parts:
$$
\begin{equation}
\mathbb{E}_{q_\phi(z|x)}[\log \frac{p(x,z)}{q_\phi(z|x)}] = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) || p(z))
\end{equation}
$$
Here $q_\phi(z|x)$ can be regarded as the encoder with parameters $\phi$ and $p_\theta(x|z)$ can be regarded as the decoder with parameters $\theta$. So there are two kinds of parameters to be optimized:
- Encoder parameters: $\phi$
- Decoder parameters: $\theta$
The meaning of the name of VAE is:
- Variational: optimize for the best $q_\phi(z|x)$ among a family of potential posterior distributions parameterized by $\phi$.
- AutoEncoder: reminiscent of a traditional autoencoder model.
How dose VAE optimize the ELBO jointly with parameters $\phi$ and $\theta$? To my understanding, VAE uses three techniques to do this:
-
Gaussian assumption.
-
Monte Carlo estimate.
-
Reparameterization trick.
The encoder $q_\phi(z|x)$ and the prior $p(z)$ are commonly chosen to be Gaussian, which makes the KL divergence term of the ELBO having closed-form expression, i.e.
$$
\begin{align}
q_\phi(z|x) &= \mathcal{N}(z; \mu_\phi(x), \sigma_\phi^2(x)I) \\
p(z) &= \mathcal{N}(z; 0, I) \\
D_{KL}(q_\phi(z|x) || p(z)) & \text{ can be computed analytically}.
\end{align}
$$
The first term of the ELBO can be approximated using a Monte Carlo estimate, i.e.
$$
\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] \approx \sum_{l=1}^{L}\log p_{\theta}(x|z^{(l)})
$$
where ${z^{(l)}}_{l=1}^{L}$ are sampled from $q_\phi(z|x)$
for every observation $x$ in the dataset, and $L$ is often set to be $1$ in practice.
The reparameterization trick writers a random variable as a deterministic function of a noise variable, and this allows for the optimization of the non-stochatic terms through gradient descent, i.e.
$$
z = \mu_{\phi}(x) + \sigma_{\phi}(x)\odot \epsilon \quad \text{with } \epsilon \sim \mathcal{N}(\epsilon;0, \textbf{I})
$$
The objective function of VAE is ELBO which contains the reconstruction term $\log [p_\theta(x|z)]$, and which kind of loss is commonly used in practice for the reconstruction term? MSE?
- The real reason you use MSE and cross-entropy loss functions
- Modern Latent Variable Models and Variational Inference
- Why don’t we use MSE as a reconstruction loss for VAE ?
A Hierarchical VAE is a generalization of a VAE that extends to multiple hierarchies over latent variables. A special case of HVAE is Markovian HVAE where the generative process is a Markov chain.
Variational Diffusion Models
A variational diffusion model can be regarded as a special case of a Markovian HVAE with three restrictions:
-
The latent dimension is exactly equal to the data dimension.
-
The encoder is pre-defined as a linear Gaussian.
-
The Gaussian parameters of the latent encoders vary over time in such a way that the distribution of the latent at final timestep $T$ is a standard Gaussian.
The variational diffusion model can be optimized by maximizing the ELBO, which can be derived as:
$$
\begin{align}
\log p(x) & \geq \mathbb{E}_{q(x_{1:T}|x_0)}[\log \frac{p(x_{0:T})}{q(x_{1:T}|x_0)}] \\
& = \mathbb{E}_{q(x_1|x_0)}[\log p_\theta(x_0|x_1)] - \mathbb{E}_{q(x_{T-1}|x_0)}[D_{KL}(q(x_T|x_{T-1}) || p(x_T))] \\
& \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad - \sum_{t=1}^{T-1}\mathbb{E}_{q(x_{t-1},x_{t+1}|x_0)}[D_{KL}(q(x_t|x_{t-1}) || p_\theta(x_t|x_{t+1}))]
\end{align}
$$
Currently, I don’t know how to derive the third term of the above equation, i.e.
$$ q(x_{t-1}, x_t, x_{t+1}|x_0) \overset{\text{?}}{=} q(x_{t-1},x_{t+1}|x_0) q(x_t|x_{t-1}) $$
As pointed out by Calvin Luo, optimizing the ELBO using the terms above is suboptimal; because the third term is computed as an expectation over two random variables for every timestep, and the variance of its Monte Carlo estimate could potentially be higher than that only having one variable.
Using Markov property and Bayes rule, we have
$$
\begin{align}
q(x_t|x_{t-1}) &= q(x_t|x_{t-1}, x_0) \\
&= \frac{q(x_{t-1}|x_t,x_0) q(x_t|x_0)}{q(x_{t-1}|x_0)}
\end{align}
$$
Armed with above equation, the ELBO can be derived as:
$$
\begin{align}
\log p(x) & \geq \mathbb{E}_{q(x_{1:T}|x_0)}[\log \frac{p(x_{0:T})}{q(x_{1:T}|x_0)}] \\
& = \underbrace{\mathbb{E}_{q(x_1|x_0)}[\log p_\theta(x_0|x_1)]}_{\text{reconstruction term}} - \underbrace{D_{KL}(q(x_T|x_0) || p(x_T))}_{\text{prior matching term}} - \sum_{t=2}^{T}\underbrace{\mathbb{E}_{q(x_t|x_0)}[D_{KL}(q(x_{t-1}|x_t,x_0) || p_\theta(x_{t-1}|x_t))]}_{\text{denoising matching term}}
\end{align}
$$
Now, to optimize the ELBO, the main difficulty comes from the denoising matching term. However, with Gaussian assumption in the variational diffusion model, a nice property is that $q(x_{t-1}|x_t, x_0)$ has a closed form: Gaussian! Let’s explain how this nice property emerges.
Step 1. Rewrite using Bayes rule and Markov property.
$$
q(x_{t-1}|x_t, x_0) = \frac{q(x_t|x_{t-1}, x_0) q(x_{t-1}|x_0)}{q(x_t|x_0)} = \frac{q(x_t|x_{t-1}) q(x_{t-1}|x_0)}{q(x_t|x_0)}
$$
Step 2. $q(x_t|x_{t-1})$ is Gaussian.
$$
q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t} x_{t-1}, (1-\alpha_t)I)
$$
Step 3. $q(x_t|x_0)$ can be recursively derived through repeated applications of the reparameterization trick. And it is also Gaussian!
$$
q(x_t|x_0) \sim \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t)I)
$$
where $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$
.
Step 4. Putting above all together, we can calculate the Gaussian form of $q(x_{t-1}|x_t, x_0)$
.
$$
\begin{align}
q(x_{t-1}|x_t, x_0) & \propto \mathcal{N}(x_{t-1}; \underbrace{\frac{\sqrt{\alpha_t} (1-\bar{\alpha}_{t-1}) x_t + \sqrt{\bar{\alpha}_{t-1}} (1-\alpha_t) x_0}{1-\bar{\alpha}_t}}_{\mu_q(x_t,x_0)}, \underbrace{\frac{(1-\alpha_t) (1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}I}_{\sigma_q^2(t)I}) \\
& = \mathcal{N}(x_{t-1}; \mu_q(x_t, x_0), \sigma_q^2(t)I)
\end{align}
$$
Now we have derived the form of $q(x_{t-1}|x_t, x_0)$. To calculate the KL Divergence between $q(x_{t-1}|x_t, x_0)$ and $q(x_{t-1}|x_t)$, we need to model $q(x_{t-1}|x_t)$ first. But how? In the variational diffusion model, $q(x_{t-1}|x_t)$ is modeled as:
- a Gaussian
- the mean $\mu_\theta(x_t, t)$ is parameterized by $\theta$ and is set to the form which is similar with $\mu_q(x_t, x_0)$:
$\mu_\theta(x_t, t) = \frac{\sqrt{\alpha_t} (1-\bar{\alpha}_{t-1}) x_t + \sqrt{\bar{\alpha}_{t-1}} (1-\alpha_t) \hat{x}_\theta(x_t, t)}{1-\bar{\alpha}_t}$
- the variance is same with $q(x_{t-1}|x_t, x_0)$, i.e. $\sigma_q^2(t)I$
$$ q(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_q^2(t)I) $$
As a result, we can derive the KL Divergence as:
$$
\begin{align}
D_{KL}(q(x_{t-1}|x_t,x_0) || p_\theta(x_{t-1}|x_t)) &= \frac{1}{2\sigma_q^2(t)}[||\mu_\theta(x_t, t) - \mu_q(x_t, x_0)||_2^2] \\
& = \frac{1}{2\sigma_q^2(t)} \frac{\bar{\alpha}_{t-1}(1-\alpha_t)^2}{(1-\bar{\alpha}_t)^2} [||\hat{x}_\theta(x_t, t) - x_0||_2^2]
\end{align}
$$
In practice, $\hat{x}_\theta(x_t, t)$ is parameterized by a neural network.
In summary, we have shown that optimizing a variational diffusion model boils down to learning a neural network to predict the ground truth data from a noised version of it.
Awesome Materials
For further resources and deeper insights, check out the following (non-exhaustive) list of blogs:
-
Jianlin Su, Diffusion models (Chinese, 中文)
-
Calvin Luo, Understanding Diffusion Models: A Unified Perspective
-
Yang Song, Generative Modeling by Estimating Gradients of the Data Distribution
-
Hugging Face, The Annotated Diffusion Model
-
Lilian Weng, What are Diffusion Models?
References
[1]. Luo, Calvin. “Understanding diffusion models: A unified perspective.” arXiv preprint arXiv:2208.11970 (2022).