VAE

CS492: Diffusion Models and Their Applications (Fall 2024)

 

 1. Sampling

If the probability density function (PDF) of the distribution is known, we can sample from from it directly.

First define cumulative distribution function (CDF) $F_X(x)$ from PDF $p(s)$

$$F_X(x)=Pr(X\le x)=\int^x_0 p(t)dt$$

 Then pick uniformly sampled $u\sim \mathcal{U}(0,1)$.

Take $x=F^{-1}_X(u)$.

 

If the inverse of the CDF cannot be computed, we used Acceptance-Rejection Method (MCMC).



2. Neural Network

Mapping from the (simple: Gaussian) latent distribution $p(\mathbf{z})$ to the data distribution $p(\mathbf{x})$.

 

 

 

3. Generative Adversarial Network (GAN)

 Train Discriminator (classify real/fake) for defense.

Train Generator (create real-like fake) for attack.

원본이 더 어색함. 甲 | (백업)유머 게시판(2017-2018) | RULIWEB

 

 

4. Variational Autoencoder (VAE)

Use conditional distribution $p(\mathbf{x}|\mathbf{z})$.

Consider decoder (generator) as predicting the mean of the conditional distribution $p(\mathbf{x}|\mathbf{z})$: $$p(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x}; D(\mathbf{z}),\sigma^2\mathbf{I})$$ (Fixed variance)



4.1 Maginal Distribution

The marginal distribution of a subset of a set of random variables is the probability distribution of the variables contained in the subset. $$p(\mathbf{x})=\int p(\mathbf{x},\mathbf{z})d\mathbf{z}$$



4.2 Expected Value

The expected value is the arithmetic mean of the possible values a random variable can take, weighted by the probability of those outcomes. $$\mathbb{E}_{p(x)}[\mathbf{x}]=\int \mathbf{x}\cdot p(\mathbf{x})d\mathbf{x}$$



4.3. Bayes' Rule

Bayes' rule is a mathematical formula used to determine the conditional probability of events. $$p(\mathbf{z}|\mathbf{x})=\frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{x})}$$

$p(\mathbf{z}|\mathbf{x})$: Posterior

$p(\mathbf{x}|\mathbf{z})$: Likelihood

$p(\mathbf{z})$: Prior

$p(\mathbf{x})$: Marginal



4.4 Kullback-Leibler (KL) Divergence

Kullback-Leibler (KL) divergence is a measure of how one probability distribution $p$ is different from a reference probability distribution $q$: $$D_{KL}(p\ ||\ q)=\int p(\mathbf{x})\ln \left(\frac{p(\mathbf{x})}{q(\mathbf{x})}\right) d\mathbf{x}=\mathbb{E}_{p(\mathbf{x})}\left[ \ln \left( \frac{p(\mathbf{x})}{q(\mathbf{x})}\right)\right]$$



4.5. Jensen's Inequality

When $\mathbf{x}$ is a random variable and $f$ is a convex function, then $$f(\mathbb{E}_{p(\mathbf{x})}[\mathbf{x}])\le \mathbb{E}_{p(\mathbf{x})}[f(\mathbf{x})]$$




5. VAE Algorithm

For all given real images $\mathbf{x}$, we want to maximize the marginal probability: $$p(\mathbf{x})=\int p(\mathbf{x},\mathbf{z})d\mathbf{z}=\int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})d\mathbf{z}$$


We want to maximize $p(\mathbf{x})$.


However, integral is hard. (MCMC can do, but not trackable)


Another way is $$p(\mathbf{x})=\frac{p(\mathbf{x},\mathbf{z})}{p(\mathbf{z}|\mathbf{x})}=\frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{z}|\mathbf{x})}$$


We cannot directly maximize $p(x)=\frac{p(\mathbf{x},\mathbf{z})}{p(\mathbf{z}|\mathbf{x})}$ since $p(\mathbf{z}|\mathbf{x})$ is unknown.



5.1 Evidence Lower Bound (ELBO)

Let's think about the lowe bound of $\ln p(\mathbf{x})$: $$\ln p(x)\ge \mathbb{E}_{q_\phi (\mathbf{z}|\mathbf{x})}\left[\ln\frac{p(\mathbf{x},\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})}\right]$$ we can maximize $p(\mathbf{x})$ by maximize ELBO.


$q_\phi(\mathbf{z}|\mathbf{x})$ is a variational (proxy) distribution with parameters $\phi$. (Different from $p(\mathbf{z}|\mathbf{x}$.)


Proof)

$$\ln p(\mathbf{x})=\ln \int p(\mathbf{x},\mathbf{z})d\mathbf{z}=\ln \int p(\mathbf{x},\mathbf{z})\frac{q_\phi (\mathbf{z}|\mathbf{x})}{q_\phi (\mathbf{z}|\mathbf{x})}d\mathbf{z} \\ =\ln\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[\frac{p(\mathbf{x},\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})}\right]\ge \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[ \ln \frac{p(\mathbf{x},\mathbf{z})}{q_\phi (\mathbf{z}|\mathbf{x})}\right]$$ (Use $\ln$ is concave function in last line)


We can decompose ELBO. $$\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[ \ln \frac{p(\mathbf{x},\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})}\right]=\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[ \ln \frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})}\right]\\ =\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\ln p(\mathbf{x}|\mathbf{z})]-\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[ \ln \frac{q_\phi(\mathbf{z}|\mathbf{x})}{p(\mathbf{z})}\right]\\ =\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\ln p(\mathbf{x}|\mathbf{z})]-D_{KL}\left(q_\phi(\mathbf{z}|\mathbf{x})||p(\mathbf{z})\right)$$

First term is "Reconstruction term", and second term is "Prior matching term"



5.2 Architecture

Date distribution: $p(\mathbf{x})$

Encoder: $q_\phi(\mathbf{z}|\mathbf{x})=\mathcal{N}(\mathbf{z};\mathbf{\mu}_\phi(\mathbf{x}), \sigma^2_\phi(\mathbf{x})\mathbf{I})$

Latent distribution: $p(\mathbf{z})=\mathcal{N}(\mathbf{z};\mathbf{0},\mathbf{1})$

Decoder: $p_\theta(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x};D_\theta(\mathbf{z}),\sigma^2I)$



5.3 Training

Maximize ELBO

$$\text{argmax}_{\theta,\phi}\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\ln p(\mathbf{x}|\mathbf{z})]-D_{KL}\left(q_\phi(\mathbf{z}|\mathbf{x})||p(\mathbf{z})\right)$$


Approximates using a Monte Carlo estimate: $$\text{argmax}_{\theta\phi}\frac{1}{N}\sum^N_{i=1}\ln p_\theta(\mathbf{x}|\mathbf{z}^{(i)})-D_{KL}\left(q_\phi(\mathbf{z}|\mathbf{x})||p(\mathbf{z})\right)$$ where $\mathbf{z}^{(i)}\sim q_\phi(\mathbf{z}|\mathbf{x})$ for given $\mathbf{x}$.

Set $\mathbf{z}^{(i)}=\mathbf{\mu}_\phi(\mathbf{x})+\sigma_\phi(\mathbf{x})\mathbf{\epsilon}$ where $\mathbf{\epsilon}\sim N(\mathbf{0},\mathbf{I})$.


Recall $p_\theta(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x};D_\theta(\mathbf{z}),\sigma^2\mathbf{I})$. $$\ln p_\theta(\mathbf{x}|\mathbf{z}^{(i)})=\ln \left( \frac{1}{\sqrt{(2\pi\sigma^2)^d}}\exp \left( -\frac{||\mathbf{x}-D_\theta(\mathbf{z})||^2}{2\sigma^2}\right)\right)\\ =-\frac{1}{2\sigma^2}||\mathbf{x}-D_\theta(\mathbf{z})||^2-\ln\sqrt{(2\pi\sigma^2)^d}$$

Second term is constant (ignore), first term mean "Reconstruction".



5.4 Training Process

A. Feed a data point $\mathbf{x}$ to the encoder to predict $\mathbf{\mu}_\phi(\mathbf{x})$ and $\sigma^2_\phi(\mathbf{x})$.

B. Sample a latent variable $\mathbf{z}$ from $q_\phi(\mathbf{z}|\mathbf{x})=\mathcal{N}\left( \mathbf{z};\mathbf{\mu}_\phi(\mathbf{x}),\sigma^2_\phi (\mathbf{x})\mathbf{I}\right)$.

C. Feed $\mathbf{z}$ to the decoder to predict $\hat{x}=D_\theta(\mathbf{z})$.

D. Compute the gradient decent through the negative ELBO.



5.5 Generating Process

A. Sample a latent variable $\mathbf{z}$ from $p(\mathbf{z})=\mathcal{N}(\mathbf{z};\mathbf{0},\mathbf{I})$.

B. Feed $\mathbf{z}$ to the decoder to predict $\hat{\mathbf{x}}=D_\theta(\mathbf{z})$.




6. Limit

Approximate to Gaussian of $p(\mathbf{z}|\mathbf{x})$