VAE
CS492: Diffusion Models and Their Applications (Fall 2024)
1. Sampling
If the probability density function (PDF) of the distribution is known, we can sample from from it directly.
First define cumulative distribution function (CDF) $F_X(x)$ from PDF $p(s)$
$$F_X(x)=Pr(X\le x)=\int^x_0 p(t)dt$$
Then pick uniformly sampled $u\sim \mathcal{U}(0,1)$.
Take $x=F^{-1}_X(u)$.
If the inverse of the CDF cannot be computed, we used Acceptance-Rejection Method (MCMC).
2. Neural Network
Mapping from the (simple: Gaussian) latent distribution $p(\mathbf{z})$ to the data distribution $p(\mathbf{x})$.
3. Generative Adversarial Network (GAN)
Train Discriminator (classify real/fake) for defense.
Train Generator (create real-like fake) for attack.
4. Variational Autoencoder (VAE)
Use conditional distribution $p(\mathbf{x}|\mathbf{z})$.
Consider decoder (generator) as predicting the mean of the conditional distribution $p(\mathbf{x}|\mathbf{z})$: $$p(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x}; D(\mathbf{z}),\sigma^2\mathbf{I})$$ (Fixed variance)
4.1 Maginal Distribution
The marginal distribution of a subset of a set of random variables is the probability distribution of the variables contained in the subset. $$p(\mathbf{x})=\int p(\mathbf{x},\mathbf{z})d\mathbf{z}$$
4.2 Expected Value
The expected value is the arithmetic mean of the possible values a random variable can take, weighted by the probability of those outcomes. $$\mathbb{E}_{p(x)}[\mathbf{x}]=\int \mathbf{x}\cdot p(\mathbf{x})d\mathbf{x}$$
4.3. Bayes' Rule
Bayes' rule is a mathematical formula used to determine the conditional probability of events. $$p(\mathbf{z}|\mathbf{x})=\frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{x})}$$
$p(\mathbf{z}|\mathbf{x})$: Posterior
$p(\mathbf{x}|\mathbf{z})$: Likelihood
$p(\mathbf{z})$: Prior
$p(\mathbf{x})$: Marginal
4.4 Kullback-Leibler (KL) Divergence
Kullback-Leibler (KL) divergence is a measure of how one probability distribution $p$ is different from a reference probability distribution $q$: $$D_{KL}(p\ ||\ q)=\int p(\mathbf{x})\ln \left(\frac{p(\mathbf{x})}{q(\mathbf{x})}\right) d\mathbf{x}=\mathbb{E}_{p(\mathbf{x})}\left[ \ln \left( \frac{p(\mathbf{x})}{q(\mathbf{x})}\right)\right]$$
4.5. Jensen's Inequality
When $\mathbf{x}$ is a random variable and $f$ is a convex function, then $$f(\mathbb{E}_{p(\mathbf{x})}[\mathbf{x}])\le \mathbb{E}_{p(\mathbf{x})}[f(\mathbf{x})]$$
5. VAE Algorithm
For all given real images $\mathbf{x}$, we want to maximize the marginal probability: $$p(\mathbf{x})=\int p(\mathbf{x},\mathbf{z})d\mathbf{z}=\int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})d\mathbf{z}$$
We want to maximize $p(\mathbf{x})$.
However, integral is hard. (MCMC can do, but not trackable)
Another way is $$p(\mathbf{x})=\frac{p(\mathbf{x},\mathbf{z})}{p(\mathbf{z}|\mathbf{x})}=\frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{z}|\mathbf{x})}$$
We cannot directly maximize $p(x)=\frac{p(\mathbf{x},\mathbf{z})}{p(\mathbf{z}|\mathbf{x})}$ since $p(\mathbf{z}|\mathbf{x})$ is unknown.
5.1 Evidence Lower Bound (ELBO)
Let's think about the lowe bound of $\ln p(\mathbf{x})$: $$\ln p(x)\ge \mathbb{E}_{q_\phi (\mathbf{z}|\mathbf{x})}\left[\ln\frac{p(\mathbf{x},\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})}\right]$$ we can maximize $p(\mathbf{x})$ by maximize ELBO.
$q_\phi(\mathbf{z}|\mathbf{x})$ is a variational (proxy) distribution with parameters $\phi$. (Different from $p(\mathbf{z}|\mathbf{x}$.)
Proof)
$$\ln p(\mathbf{x})=\ln \int p(\mathbf{x},\mathbf{z})d\mathbf{z}=\ln \int p(\mathbf{x},\mathbf{z})\frac{q_\phi (\mathbf{z}|\mathbf{x})}{q_\phi (\mathbf{z}|\mathbf{x})}d\mathbf{z} \\ =\ln\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[\frac{p(\mathbf{x},\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})}\right]\ge \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[ \ln \frac{p(\mathbf{x},\mathbf{z})}{q_\phi (\mathbf{z}|\mathbf{x})}\right]$$ (Use $\ln$ is concave function in last line)
We can decompose ELBO. $$\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[ \ln \frac{p(\mathbf{x},\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})}\right]=\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[ \ln \frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})}\right]\\ =\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\ln p(\mathbf{x}|\mathbf{z})]-\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[ \ln \frac{q_\phi(\mathbf{z}|\mathbf{x})}{p(\mathbf{z})}\right]\\ =\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\ln p(\mathbf{x}|\mathbf{z})]-D_{KL}\left(q_\phi(\mathbf{z}|\mathbf{x})||p(\mathbf{z})\right)$$
First term is "Reconstruction term", and second term is "Prior matching term"
5.2 Architecture
Date distribution: $p(\mathbf{x})$
Encoder: $q_\phi(\mathbf{z}|\mathbf{x})=\mathcal{N}(\mathbf{z};\mathbf{\mu}_\phi(\mathbf{x}), \sigma^2_\phi(\mathbf{x})\mathbf{I})$
Latent distribution: $p(\mathbf{z})=\mathcal{N}(\mathbf{z};\mathbf{0},\mathbf{1})$
Decoder: $p_\theta(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x};D_\theta(\mathbf{z}),\sigma^2I)$
5.3 Training
Maximize ELBO
$$\text{argmax}_{\theta,\phi}\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\ln p(\mathbf{x}|\mathbf{z})]-D_{KL}\left(q_\phi(\mathbf{z}|\mathbf{x})||p(\mathbf{z})\right)$$
Approximates using a Monte Carlo estimate: $$\text{argmax}_{\theta\phi}\frac{1}{N}\sum^N_{i=1}\ln p_\theta(\mathbf{x}|\mathbf{z}^{(i)})-D_{KL}\left(q_\phi(\mathbf{z}|\mathbf{x})||p(\mathbf{z})\right)$$ where $\mathbf{z}^{(i)}\sim q_\phi(\mathbf{z}|\mathbf{x})$ for given $\mathbf{x}$.
Set $\mathbf{z}^{(i)}=\mathbf{\mu}_\phi(\mathbf{x})+\sigma_\phi(\mathbf{x})\mathbf{\epsilon}$ where $\mathbf{\epsilon}\sim N(\mathbf{0},\mathbf{I})$.
Recall $p_\theta(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x};D_\theta(\mathbf{z}),\sigma^2\mathbf{I})$. $$\ln p_\theta(\mathbf{x}|\mathbf{z}^{(i)})=\ln \left( \frac{1}{\sqrt{(2\pi\sigma^2)^d}}\exp \left( -\frac{||\mathbf{x}-D_\theta(\mathbf{z})||^2}{2\sigma^2}\right)\right)\\ =-\frac{1}{2\sigma^2}||\mathbf{x}-D_\theta(\mathbf{z})||^2-\ln\sqrt{(2\pi\sigma^2)^d}$$
Second term is constant (ignore), first term mean "Reconstruction".
5.4 Training Process
A. Feed a data point $\mathbf{x}$ to the encoder to predict $\mathbf{\mu}_\phi(\mathbf{x})$ and $\sigma^2_\phi(\mathbf{x})$.
B. Sample a latent variable $\mathbf{z}$ from $q_\phi(\mathbf{z}|\mathbf{x})=\mathcal{N}\left( \mathbf{z};\mathbf{\mu}_\phi(\mathbf{x}),\sigma^2_\phi (\mathbf{x})\mathbf{I}\right)$.
C. Feed $\mathbf{z}$ to the decoder to predict $\hat{x}=D_\theta(\mathbf{z})$.
D. Compute the gradient decent through the negative ELBO.
5.5 Generating Process
A. Sample a latent variable $\mathbf{z}$ from $p(\mathbf{z})=\mathcal{N}(\mathbf{z};\mathbf{0},\mathbf{I})$.
B. Feed $\mathbf{z}$ to the decoder to predict $\hat{\mathbf{x}}=D_\theta(\mathbf{z})$.
6. Limit
Approximate to Gaussian of $p(\mathbf{z}|\mathbf{x})$