Variational Inference
Reference: Prof. Dr. Stephan Günnemann, Data Analytics and Machine Learning
Latent Variable Models
1. Generate the latent variable $\mathbf{z}$ $$\mathbf{z}\sim p_\mathbf{\theta}(\mathbf{z})$$
2. Generate the data $\mathbf{x}$ conditional on $\mathbf{z}$ $$\mathbf{x}\sim p_\mathbf{\theta} (\mathbf{x}|\mathbf{z})$$
3. The above procedure defines the joint distribution $$p_\mathbf{\theta}(\mathbf{x},\mathbf{z})=p_\mathbf{\theta}(\mathbf{z})p_\mathbf{\theta}(\mathbf{x}|\mathbf{z})$$
4. Marginal likelihood $$p_\mathbf{\theta}=\int p_\mathbf{\theta}(\mathbf{x},\mathbf{z})d\mathbf{z}=\int p_\mathbf{\theta}(\mathbf{z})p_\theta(\mathbf{x}|\mathbf{z})d\mathbf{z}=\mathbb{E}_{\mathbf{z}\sim p_\theta(\mathbf{z})}[p_\mathbf{\theta}(\mathbf{x}|\mathbf{z})]$$
LVM capture latent feature of Data.
1. Inference: Given a sample $\mathbf{x}$, find the posterior distribution over $\mathbf{z}$
(This can be viewed as "extracting" the latent features $z$ from real features $x$.) $$p_\mathbf{\theta}(\mathbf{z}|\mathbf{x})=\frac{p_\mathbf{\theta}(\mathbf{x}|\mathbf{z})p_\mathbf{\theta}(\mathbf{z})}{p_\mathbf{\theta}(\mathbf{x})}$$
2. Learning: Given a dataset $\mathbf{X}=\{\mathbf{x}_i\}^N_{i=1}$ find the parameters $\mathbf{\theta}$ that best explain the data
(Typically done by maximizing the marginal log-likelihood) $$\max_\mathbf{\theta}\ln p_\mathbf{\theta}(\mathbf{X})=\max_\mathbf{\theta}\frac{1}{N}\sum^N_{i=1}\ln p_\mathbf{\theta}(\mathbf{x}_i)$$
This means "this model $p_\mathbf{\theta}$ can generate data $\mathbf{X}$ well".
Let $f(\mathbf{\theta})$ becomes $$\ln p_\mathbf{\theta}(\mathbf{x})=\ln \ln \left(\int p_\mathbf{\theta}(\mathbf{x}|\mathbf{z})p_\theta(\mathbf{z})d\mathbf{z}\right):=f(\mathbf{\theta})$$
Maximization using Lower Bounds
We want to maximize $f(\theta)$, but $f$ and $\nabla f$ are intractable.
So we use $g(\theta)$ that is lower bound on $f(\theta)$.
Maximizing $g(\theta)$ would give us lower bound on the solution of the original optimization problem $$\max_\theta f(\theta) \ge \max_\theta g(\theta)$$
Let $q(z)$ be an arbitrary distribution over $z$.
Second term is Kullback-Leibler Divergence, which is non-negative.
First term is lower bound: "Evidence Lower BOund (ELBO) $\mathcal{L}(\mathbf{\theta},q)$.
(If we set $q(z)=p_\theta(z|x)$, it's happy, but $p_\theta$ is intractable.)
We want to maximize the ELBO $$\max_{\theta,\phi}\mathbb{E}_{z\sim q_\phi(z)}[\ln p_\theta(x,z)-\ln q_\phi(z)]=:\max_{\theta,\phi}\mathcal{L}(\theta,\phi)$$
Use gradient descent, we need $\nabla_\theta \mathcal{L}(\theta,\phi)$ and $\nabla_\phi \mathcal{L}(\theta,\phi)$.
1. For $\theta$, we can approximate integral using Monte Carlo.
2. For $\phi$, we can't. Since $\phi$ also in probability.We need "Reparametrization Trick".
Mean Field Assumption
We approximate $q(\mathbf{Z}$ is factorizes $$q(Z)=\prod^N_{i=1} q_i(\mathbf{z})i)$$
VAE
$p(\mathbf{z})=\mathcal{N}(\mathbf{z}|\mathbf{0},\mathbf{I})$.
Decoder $p_\mathbf{\theta}(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x}|\mathbf{\mu}=f_\psi(\mathbf{z}),\mathbf{I})$
Encoder $q_\phi(\mathbf{z})=\mathcal{N}(\mathbf{z}|\mathbf{\mu},\mathbf{\Sigma})$
Then ELBO is $\mathcal{L}(\theta,\phi)=\mathbb{E}_{z\sim q_\phi(z)}[\ln p_\theta(x|z)]-\mathbb{KL}(q_\phi(z)||p(z))$