Variational Inference

 Reference: Prof. Dr. Stephan Günnemann, Data Analytics and Machine Learning


Latent Variable Models

1. Generate the latent variable $\mathbf{z}$ $$\mathbf{z}\sim p_\mathbf{\theta}(\mathbf{z})$$

2. Generate the data $\mathbf{x}$ conditional on $\mathbf{z}$ $$\mathbf{x}\sim p_\mathbf{\theta} (\mathbf{x}|\mathbf{z})$$

3. The above procedure defines the joint distribution $$p_\mathbf{\theta}(\mathbf{x},\mathbf{z})=p_\mathbf{\theta}(\mathbf{z})p_\mathbf{\theta}(\mathbf{x}|\mathbf{z})$$

4. Marginal likelihood $$p_\mathbf{\theta}=\int p_\mathbf{\theta}(\mathbf{x},\mathbf{z})d\mathbf{z}=\int p_\mathbf{\theta}(\mathbf{z})p_\theta(\mathbf{x}|\mathbf{z})d\mathbf{z}=\mathbb{E}_{\mathbf{z}\sim p_\theta(\mathbf{z})}[p_\mathbf{\theta}(\mathbf{x}|\mathbf{z})]$$



LVM capture latent feature of Data.

1. Inference: Given a sample $\mathbf{x}$, find the posterior distribution over $\mathbf{z}$

(This can be viewed as "extracting" the latent features $z$ from real features $x$.) $$p_\mathbf{\theta}(\mathbf{z}|\mathbf{x})=\frac{p_\mathbf{\theta}(\mathbf{x}|\mathbf{z})p_\mathbf{\theta}(\mathbf{z})}{p_\mathbf{\theta}(\mathbf{x})}$$


2. Learning: Given a dataset $\mathbf{X}=\{\mathbf{x}_i\}^N_{i=1}$ find the parameters $\mathbf{\theta}$ that best explain the data

(Typically done by maximizing the marginal log-likelihood) $$\max_\mathbf{\theta}\ln p_\mathbf{\theta}(\mathbf{X})=\max_\mathbf{\theta}\frac{1}{N}\sum^N_{i=1}\ln p_\mathbf{\theta}(\mathbf{x}_i)$$

This means "this model $p_\mathbf{\theta}$ can generate data $\mathbf{X}$ well".

Let $f(\mathbf{\theta})$ becomes $$\ln p_\mathbf{\theta}(\mathbf{x})=\ln \ln \left(\int p_\mathbf{\theta}(\mathbf{x}|\mathbf{z})p_\theta(\mathbf{z})d\mathbf{z}\right):=f(\mathbf{\theta})$$



Maximization using Lower Bounds

We want to maximize $f(\theta)$, but $f$ and $\nabla f$ are intractable.

So we use $g(\theta)$ that is lower bound on $f(\theta)$.

Maximizing $g(\theta)$ would give us lower bound on the solution of the original optimization problem $$\max_\theta f(\theta) \ge \max_\theta g(\theta)$$

Let $q(z)$ be an arbitrary distribution over $z$.


Second term is Kullback-Leibler Divergence, which is non-negative.

First term is lower bound: "Evidence Lower BOund (ELBO) $\mathcal{L}(\mathbf{\theta},q)$.


(If we set $q(z)=p_\theta(z|x)$, it's happy, but $p_\theta$ is intractable.)


We want to maximize the ELBO $$\max_{\theta,\phi}\mathbb{E}_{z\sim q_\phi(z)}[\ln p_\theta(x,z)-\ln q_\phi(z)]=:\max_{\theta,\phi}\mathcal{L}(\theta,\phi)$$


Use gradient descent, we need $\nabla_\theta \mathcal{L}(\theta,\phi)$ and $\nabla_\phi \mathcal{L}(\theta,\phi)$.

1. For $\theta$, we can approximate integral using Monte Carlo.

2. For $\phi$, we can't. Since $\phi$ also in probability.We need "Reparametrization Trick".



Mean Field Assumption

We approximate $q(\mathbf{Z}$  is factorizes $$q(Z)=\prod^N_{i=1} q_i(\mathbf{z})i)$$



VAE

$p(\mathbf{z})=\mathcal{N}(\mathbf{z}|\mathbf{0},\mathbf{I})$.

Decoder $p_\mathbf{\theta}(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x}|\mathbf{\mu}=f_\psi(\mathbf{z}),\mathbf{I})$

Encoder $q_\phi(\mathbf{z})=\mathcal{N}(\mathbf{z}|\mathbf{\mu},\mathbf{\Sigma})$


Then ELBO is $\mathcal{L}(\theta,\phi)=\mathbb{E}_{z\sim q_\phi(z)}[\ln p_\theta(x|z)]-\mathbb{KL}(q_\phi(z)||p(z))$