DDPM
CS492: Diffusion Models and Their Applications (Fall 2024)
Denoising Diffusion Probailistic Models
1. Markovain Hierachical VAE
Joint distribution: $$p_\theta(\mathbf{x}_{0:T})=p_\theta(\mathbf{x}_T)\prod^T_{t=1}p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$$
Variational posterior: $$q_\phi(\mathbf{x}_{1:T}|\mathbf{x}_0)=\prod^T_{t=1}q_\phi(\mathbf{x}_t|\mathbf{x}_{t-1})$$
$$\ln p(\theta_0)=\ln \int p_\theta(\mathbf{x}_{0:T})d\mathbf{x}_{1:T}\\ =\ln \int p_\theta (\mathbf{x}_{0:T})\frac{q_\theta(\mathbf{x}_{1:T}|\mathbf{x}_0)}{q_\theta(\mathbf{x}_{1:T}|\mathbf{x}_0)}d\mathbf{x}_{1:T}\\ =\ln \mathbb{E}_{q_\theta(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[ \frac{p_\theta(\mathbf{x}_{0:T}}{q_\theta(\mathbf{x}_{1:T}| \mathbf{x})}\right]\\ \ge \mathbb{E}_{q_\theta(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\ln\frac{p_\theta(\mathbf{x}_{0:T})}{q_\phi(\mathbf{x}_{1:T}|\mathbf{x}_0)}\right]$$
2. Diffusion Model
Diffusion model is special case of the Markovian hierarchical VAE.
- the latent dimension is the same as the data dimension
- the variational posteriors $q_\phi(\mathbf{x}_{t+1}|\mathbf{x}_t)$ are not learning but predefined: $$q_\phi(\mathbf{x}_{t+1}|\mathbf{x}_t)\rightarrow q(\mathbf{x}_{t+1}|\mathbf{x}_t)$$
3. Forward process (predefined)
$$q(\mathbf{x}_{1:T}|\mathbf{x}_0)=\prod^T_{t=1}q(\mathbf{x}_t|\mathbf{x}_{t-1})$$
In the forward process, the transition distribution $q(\mathbf{x}_t|\mathbf{x}_{t-1})$ is specifically predefined as: $$q(\mathbf{x}_t|\mathbf{x}_{t-1})=\mathcal{N}\left( \mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t I \right)$$ where $\{ \beta_t \in (0,1)\}^T_{t=1}$ and $\beta_1\le \beta_2\le \cdots \le \beta_T$.
This is adding Gaussian noise iteratively.
4. Reverse process (learned)
$$p_\theta(\mathbf{x}_{0:T})=p_\theta(\mathbf{x}_T)\prod^T_{t=1}p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$$
To maximize $\ln p(\mathbf{x}_0)$, we have to maximize ELBO.
5. ELBO
$$-\ln p(\mathbf{x}_0)=-\ln \int p_\theta(\mathbf{x}_{0:T})d\mathbf{x}_{1:T}\ge \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[ -\ln \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \right]\\ =\cdots =$$
$-\mathbb{E}_{q(\mathbf{x}_1|\mathbf{x}_0)}\left[\ln p_\theta(\mathbf{x}_0|\mathbf{x}_1)\right]$: Reconstruction term $\mathcal{L}_0$
$+ D_{KL}\left( q(\mathbf{x}_T|\mathbf{x}_0) || p(\mathbf{x}_T)\right) $: Prior matching term $\mathcal{L}_T$
$+\sum^{T-1}_{t=2} \mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}\left[D_{KL}\left( q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)||p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) \right)\right]$: Denoising matching term $\mathcal{L}_{t-1}$
6. Reconstruction Term $\mathcal{L}_0$
The same loss term as in VAE, but applied only to the final reverse step.
근데 왜무시함??
VAE 처럼 재구성 하려는게 아니라 노이즈 예측으로 점진적으로 재구성하기 때문.
7. Prior Matching Term $\mathcal{L}_T$
Identical to the KL divergence term in VAE.
$q(\mathbf{x}_T|\mathbf{x}_0)$ and $p(\mathbf{x}_T)$ are predefined.
And $q(\mathbf{x}_T|\mathbf{x}_0)=\mathcal{N}(\mathbf{x};\mathbf{0},\mathbf{I})$since $q(\mathbf{x}_t|\mathbf{x}_{t-1})$ is Gaussian.
As $T\rightarrow \infty$, $q(\mathbf{x}_T|\mathbf{x}_0)$ converges to the standard normal distribution $\mathcal{N}(\mathbf{x};\mathbf{0},\mathbf{I})$.
$\mathcal{L}\rightarrow 0$
7.A $q(\mathbf{x}_T|\mathbf{x}_0)$ is Gaussian
Let $\alpha_t=1-\beta_t$.
$$q(\mathbf{x}_1|\mathbf{x}_0)=\mathcal{N}(\mathbf{x}_1;\sqrt{\alpha_1}\mathbf{x}_0, (1-\alpha_1)\mathbf{I})\\ q(\mathbf{x}_2|\mathbf{x}_1)=\mathcal{N}(\mathbf{x}_2;\sqrt{\alpha_2}\mathbf{x}_1, (1-\alpha_2)\mathbf{I})$$
Use "reparamaterization trick": $$\begin{cases}\mathbf{x}_1= \sqrt{\alpha_1}\mathbf{x}_0+\sqrt{1-\alpha_1}\mathbf{\epsilon}_0 \\ \mathbf{x}_2= \sqrt{\alpha_2}\mathbf{x}_1+\sqrt{1-\alpha_2}\mathbf{\epsilon}_1 \end{cases}\quad \mathbf{\epsilon}_0,\mathbf{\epsilon}_1\sim \mathcal{N}(\mathbf{0},\mathbf{I})$$
Then $$\mathbf{x}_2=\sqrt{\alpha_2}\mathbf{x}_1+\sqrt{1-\alpha_2}\mathbf{\epsilon}_1\\ =\sqrt{\alpha_2}(\sqrt{\alpha_1}\mathbf{x}_0+\sqrt{1-\alpha_1}\mathbf{\epsilon}_0)+\sqrt{1-\alpha_2}\mathbf{\epsilon}_1\\ =\sqrt{\alpha_2\alpha_1}\mathbf{x}_0 + \sqrt{\alpha_2(1-\alpha_1)}\mathbf{\epsilon}_0+\sqrt{1-\alpha_2}\mathbf{\epsilon}_1\\ \sqrt{\alpha_2\alpha_1}\mathbf{x}_0+\sqrt{(1-\alpha_2\alpha_1)}\bar{\mathbf{\epsilon}}_0$$ Finally, $$q(\mathbf{x}_2|\mathbf{x}_0)=\mathcal{N}(\sqrt{\alpha_2\alpha_1}\mathbf{x}_1, (1-\alpha_2\alpha_1)\mathbf{I})$$
Use recursively, $$\mathbf{x}_t=\sqrt{\alpha_t}\mathbf{x}_{t-1}+\sqrt{1-\alpha_t}\mathbf{\epsilon}_{t-1}\\ = \sqrt{\alpha_t}(\sqrt{\alpha_{t-1}}\mathbf{x}_{t-2}+\sqrt{1-\alpha_{t-1}}\mathbf{\epsilon}_{t-2})+\sqrt{1-\alpha_t}\epsilon_{t-1}\\ = \sqrt{\alpha_t\alpha_{t-1}} \mathbf{x}_{t-2}+\sqrt{\alpha_t (1-\alpha_{t-1})}\mathbf{\epsilon}_{t-2}+\sqrt{1-\alpha_t}\mathbf{\epsilon}_{t-1}\\ = \sqrt{\alpha_t\alpha_{t-1}}\mathbf{x}_{t-2}+\sqrt{(1-\alpha_t\alpha_{t-1})}\bar{\mathbf{\epsilon}}_{t-2}\\ =\cdots\\ =\sqrt{\prod^t_{i=1}\alpha_i}\mathbf{x}_0+\sqrt{\left( 1-\prod^t_{i=1}\alpha_i\right)}\bar{\mathbf{\epsilon}}_0$$
In summary, if $q(\mathbf{x}_t|\mathbf{x}_{t-1})=\mathcal{N}(\sqrt{\alpha_t}\mathbf{x}_{t-1},(1-\alpha_t)\mathbf{I})$, then $$q(\mathbf{x}_t|\mathbf{x}_0)=\mathcal{N}\left( \sqrt{\bar{\alpha}_t}\mathbf{x}_0,(1-\bar{\alpha}_t)\mathbf{I} \right)$$ where $\bar{\alpha}_t=\prod^t_{i=1}\alpha_i$
8. Denoising Matching Term $\mathcal{L}_{t-1}$
$$+\sum^{T-1}_{t=2} \mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}\left[D_{KL}\left( q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)||p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) \right)\right]$$
The variational distribution $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$ should be close to $q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x})$ for each $t$.
$$q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)=q(\mathbf{x}_t|\mathbf{x}_{t-1},\mathbf{x}_0)\frac{q(\mathbf{x}_{t-1}|\mathbf{x}_0)}{q(\mathbf{x}_t|\mathbf{x}_0)}$$
$q(\mathbf{x}_t|\mathbf{x}_{t-1},\mathbf{x}_0)=q(\mathbf{x}_t|\mathbf{x}_{t-1})$, sine Markovian process.
We know $$q(\mathbf{x}_t|\mathbf{x}_{t-1})=\mathcal{N}\left( \sqrt{\alpha_t}\mathbf{x}_{t-1},(1-\alpha_t)\mathbf{I} \right)\\ q(\mathbf{x}_{t-1}|\mathbf{x}_0)=\mathcal{N}\left( \sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0,(1-\bar{\alpha}_{t-1})\mathbf{I} \right)\\ q(\mathbf{x}_t|\mathbf{x}_0)=\mathcal{N}\left( \sqrt{\bar{\alpha}_t}\mathbf{x}_0,(1-\bar{\alpha}_t)\mathbf{I} \right)$$
Then, $$q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)\propto \left( -\frac{1}{2}\left( \frac{(\mathbf{x}_t-\sqrt{\alpha_t}\mathbf{x}_{t-1})}{1-\alpha_t}+\frac{(\mathbf{x}_t-\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0)}{1-\bar{\alpha}_{t-1}}-\frac{(\mathbf{x}_t-\sqrt{\bar{\alpha}_t}\mathbf{x}_0)}{1-\bar{\alpha}_t} \right) \right)$$ Is normal distribution $\mathcal{N}(\tilde{\mu}(\mathbf{x}_t,\mathbf{x}_0),\tilde{\sigma}^2_t\mathbf{I})$.
The mean $$\tilde{\mu}(\mathbf{x}_t,\mathbf{x}_0)=\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t+\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}\mathbf{x}_0$$is a function of both $\mathbf{x}_t$ and $\mathbf{x}_0$.
The covariance $$\tilde{\sigma}^2_t\mathbf{I}=\left( \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t \right)\mathbf{I}$$ is predefined from the user-defined $\{\beta_t\}^T_{t=1}$
8.1 $\mu_\theta$ Predictor
Define the variational distribution $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$ as $$p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)=\mathcal{N}(\mu_\theta(\mathbf{x}_t,t),\tilde{\sigma}^2_t\mathbf{I}),$$ where $\mu_\theta(\mathbf{x}_t,t)$ is the mean predictor.
Then $$\mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}\left[D_{KL}\left( q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)||p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) \right)\right]\\ =\frac{1}{2\tilde{\sigma}_t^2}\mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}[||\mu_\theta(\mathbf{x}_t,t)-\tilde{\mu}(\mathbf{x}_t,\mathbf{x}_0)||^2]$$
8.1.A KL divergence of two Gaussian
$$p(\mathbf{x})=\mathcal{N}(\mathbf{x}; \mathbf{\mu}_p, \sigma^2\mathbf{I})\\ q(\mathbf{x})=\mathcal{N}(\mathbf{x}; \mathbf{\mu}_q,\sigma^2\mathbf{I})\\ D_{KL}(p\ ||\ q)=\frac{1}{2\sigma^2}||\mathbf{\mu}_q-\mathbf{\mu}_p||^2$$
8.2 $\mathbf{x}_0$ Predictor
What if we have a $\mathbf{x}_0$ predictor $\hat{\mathbf{x}}_\theta(\mathbf{x}_t,t)$ instead of the mean predictor $\mu_\theta(\mathbf{x}_t,t)$?
Note that $$\tilde{\mu}(\mathbf{x}_t,\mathbf{x}_0)=\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t+\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}\mathbf{x}_0$$
$$\mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}\left[D_{KL}\left( q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)||p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) \right)\right]\\ = \frac{1}{2\tilde{\sigma}_t^2}\mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}[||\mu_\theta(\mathbf{x}_t,t)-\tilde{\mu}(\mathbf{x}_t,\mathbf{x}_0)||^2]\\ =\frac{1}{2\tilde{\sigma}^2_t}\frac{\bar{\alpha}_{t-1}\beta^2_t}{(1-\bar{\alpha}_t)^2}\mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}[||\hat{x}_\theta(\mathbf{x}_t,t)-\mathbf{x}_0||^2]\\ =\omega_t\mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}[||\hat{\mathbb{x}}_\theta(\mathbf{x}_t,t)-\mathbf{x}_0||^2] $$
$\mathbf{x}_t$ is sampled from $\mathbb{x}_0$.
From $\mathbf{x}_t$, predict the expected value of $\mathbf{x}_0$ that would result in sampling $\mathbf{x}_t$ from it through the forward jump.
8.3 $\mathbf{\epsilon}_t$ Predictor
What if we have a $\mathbf{\epsilon}_t$ predictor $\hat{mathbf{\epsilon}}_\theta(\mathbf{x}_t,t)$ instead of the mean predictor $\mu_\theta(\mathbf{x}_t,t)$?
Note that $$\tilde{\mu}(\mathbf{x}_t,\mathbf{\epsilon}_t)=\frac{1}{\sqrt{\alpha_t}}\left( \mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha_t}}}\mathbf{\epsilon}_t \right)$$
Then $$\mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}\left[D_{KL}\left( q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)|| p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) \right)\right] \\ =\frac{1}{2\tilde{\sigma}_t^2}\mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}[||\mu_\theta(\mathbf{x}_t,t)-\tilde{\mu}(\mathbf{x}_t,\mathbf{x}_0)||^2]\\ =\frac{1}{2\tilde{\sigma}_t^2}\frac{(1-\bar{\alpha}_t)^2}{\bar{\alpha}_t(1-\bar{\alpha}_t)}\mathbb{E}_{q_\phi(\mathbf{x}_t|\mathbf{x}_0)}[||\hat{\mathbf{\epsilon}}_\theta(\mathbf{x}_t,t)-\mathbf{\epsilon}_t||^2]\\ =\omega'_t\mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}[||\hat{\mathbf{\epsilon}}_\theta(\mathbf{x}_t,t)-\mathbf{\epsilon}_t||^2] $$
From $\mathbf{x}_t$, predict the expected value of $\mathbf{\epsilon}_t$ that would result in sampling $\mathbf{x}_t$ from $\mathbf{x}_0$ through the forward jump.
9. Training
We choose $\hat{\mathbf{\epsilon}}_\theta$ predictor.
Minimizing $$\mathbb{E}_{\mathbf{x}_0\sim q(\mathbf{x}_0),t>1,q(\mathbf{x}_t|mathbf{x}_0)}[||\hat{\mathbf{\epsilon}}_\theta(\mathbf{x}_t,t)-\mathbf{\epsilon}_t||^2]$$
Repearing
1. Take a random $\mathbf{x}_0$.
2. Sample $t\sim \mathcal{U}(\{ 1,\cdots, T\})$.
3. Sample $\mathbf{\epsilon}_t\sim \mathcal{N}(\mathbf{0},\mathbf{I})$.
4. Compute $\mathbf{x}_t=\sqrt{\bar{\alpha}_t}\mathbf{x}_0+\sqrt{1-\bar{\alpha}_t}\mathbf{\epsilon}_t$.
5. Take gradient descent step on $\nabla_\theta||\hat{\mathbf{\epsilon}}_\theta(\mathbf{x}_t,t)-\mathbf{\epsilon}_t||^2$.
2~4 step is sampling $\mathbf{x}_t\sim q(\mathbf{x}_t|\mathbf{x}_0)=\mathcal{N}(\sqrt{\bar{\alpha}_t}\mathbf{x}_0,(1-\bar{\alpha}_t)\mathbf{I})$
10.Gerneration
Proceed the reverse process with the learned reverse transitional distribution $$p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)=\mathcal{N}\left(\frac{1}{\sqrt{\alpha_t}}\left( \mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\hat{\mathbf{\epsilon}}_\theta(\mathbf{x}_t,t) \right),\tilde{\sigma}_t^2\mathbf{I}\right)$$
1. Sample $\mathbf{x}_T\sim \mathcal{N}(\mathbf{0},\mathbf{I})$.
2. For $t=T,\cdots,1$, repeat:
2.1 Compute $\tilde{\mu}=\frac{1}{\sqrt{\alpha_t}}\left( \mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\hat{\mathbf{\epsilon}}_\theta(\mathbf{x}_t,t) \right)$.
2.2 Sample $\mathbf{z}_t\sim\mathcal{N}(\mathbf{0},\mathbf{I})$.
2.3 Compute $\mathbf{x}_{t-1}=\tilde{\mu}+\tilde{\sigma}\mathbf{z}_t$.
2.1~2.3 step is sampling $\mathbf{x}_t\sim p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$.