[Deep Learning Principle] 1. Pretraining

  This acticle is one of Textbook Comentary Project.

Let's study effective field theory.


1.1 Gaussian Integrals

Single-variable Gaussian integrals

We define the Gaussian probability distribution with unit variance as $$p(z)\equiv \frac{1}{\sqrt{2\pi}}e^{-\frac{z^2}{2}},$$ which is now properly normalized, i.e., $\int^\infty_{-\infty}dzp(z)=1$.


Extend with variance $K>0$ and mean $s$, $$p(z)\equiv \frac{1}{\sqrt{2\pi K}}e^{-\frac{-(z-s)^2}{2K}},$$ and satisfy $$\mathbb{E}[z]\equiv s,\ \mathbb{E}[z^2]-\mathbb{E}[z]^2\equiv K.$$


Given observables $\mathcal{O}(z)$, expectation value is $$\mathbb{E}[\mathcal{O}(z)]\equiv \int_{-\infty}^\infty dz p(z) \mathcal{O}(z)=\frac{1}{\sqrt{2\pi K}}\int^\infty_{-\infty} dz e^{-\frac{z^2}{2K}}\mathcal{O}(z).$$


For example, there is observables called moments: $$\mathbb{E}[z^M]=\frac{1}{\sqrt{2\pi K}}\int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}}z^M.$$ (Odd exponent $M$ vanish) We can evaluate integral $$I_{K,m}\equiv \int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}}z^{2m}=(2K^2\frac{d}{dK})^m\int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}}=(2K^2\frac{d}{dK})^mI_k=(2K^2\frac{d}{dK})^m\sqrt{2\pi}K^{\frac{1}{2}}=\sqrt{2\pi}K^{\frac{2m+1}{2}}(2m-1)(2m-3)\cdots 1,$$ finally $$\mathbb{E}[z^{2m}]=\frac{I_{K,m}}{\sqrt{2\pi K}}=K^m (2m-1)!!$$ this is Wick's theorem for single-variable Gaussian distributions.


Given source term $J$, partition function with source (generating function) is $$Z_{K,J}\equiv \int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}+Jz}.$$ We can evaluate $Z_{K,J}$ by completing the square in the exponent $$-\frac{z^2}{2K}+Jz=-\frac{(z-JK)^2}{2K}+\frac{KJ^2}{2},$$, which lets us rewrite the partition function as $$Z_{K,J}=e^{\frac{KJ^2}{2}}\int^\infty_{-\infty} dz e^{-\frac{(z-JK)^2}{2K}}=e^{\frac{KJ^2}{2}}I_K=e^{\frac{KJ^2}{2}}\sqrt{2\pi K}.$$


Source term is good tools to calculate expectation value. For example, to calculate moments, $$I_{K,m}=\int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}}z^{2m}=\left. \left[ \left(\frac{d}{dJ}\right)^{2m} \int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}+Jz}\right]\right|_{J=0}=\left. \left[ \left( \frac{d}{dJ}\right)^{2m}Z_{K,J}\right] \right|_{J=0}.$$ Then $$\mathbb{E}[z^{2m}]=\frac{I_{K,m}}{\sqrt{2\pi K}}=\left. \left[ \left( \frac{d}{dJ}\right)^{2m} e^{\frac{KJ^2}{2}}\right] \right|_{J=0}=\left. \left\{ \left( \frac{d}{dJ}\right)^{2m}\left[ \sum_{k=0}^\infty \frac{1}{k!}\left(\frac{K}{2}\right)^kJ^{2k}\right]\right\}\right|_{J=0}=\left(\frac{d}{dJ}\right)^{2m}\left[ \frac{1}{m!}\left(\frac{K}{2}\right)^mJ^{2m}\right] = K^m \frac{(2m)!}{2^m m!}=K^m(2m-1)!!,$$ which is second derivation of Wick's theorem for the single-variable Gaussian distribution.


Mult-ivariable Gaussian Integrals

For $N$-dimensional variable $z_\mu$ with $\mu=1,\cdots,N^2$, the multivariable Gaussian function is defined as $$\exp \left[ -\frac{1}{2}\sum^N_{\mu,\nu=1}z_\mu (K^{-1})_{\mu\nu}z_\nu\right],$$ where the variance or covariance matrix $K_{\mu\nu}$ is an $N$-by-$N$ symmetric positive definite matrix, and its invers $(K^{-1})_{\mu\nu}$ is defined so that their matrix product gives the $N$-by-$N$ identity matrix $$\sum_{\rho=1}^N(K^{-1})_{\mu\rho}K_{\rho\nu}=\delta_{\mu\nu},$$ where Kronecker delta $\delta_{\mu\nu}$ is representation of identity matrix.


To calculate normalization factor $$I_K\equiv \int d^Nz\exp \left[ -\frac{1}{2}\sum^N_{\mu,\nu=1}z_\mu (K^{-1})_{\mu\nu}z_\nu\right]\\ =\int_{-\infty}^\infty dz_1 \int_{-\infty}^\infty dz_2 \cdots \int_{-\infty}^\infty dz_N \exp \left[ -\frac{1}{2}\sum^N_{\mu,\nu=1}z_\mu (K^{-1})_{\mu\nu}z_\nu\right],$$ use diagonalizes $(OKO^T)_{\mu\nu}=\lambda \delta_{\mu\nu}$ with orthogonal matrix $O_{\mu\nu}$, actually, diagonalize its inverse as $(OK^{-1})O^T)_{\mu\nu}=(1/\lambda_\nu)\delta_{\mu\nu}$. Then we can simplify $$\sum^N_{\mu,\nu=1}z_\mu (K^{-1})_{\mu\nu}z_\nu=\sum^N_{\mu,\rho,\sigma,\nu=1}z_\mu (O^TO)_{\mu\rho}(K^{-1})_{\rho\sigma}(O^TO)_{\sigma\nu}z_\nu\\ =\sum^N_{\mu,\nu=1}(Oz)_\mu (OK^{-1}O^T)_{\mu\nu}(Oz)_\nu=\sum_{\mu=1}^N\frac{1}{\lambda_\mu}(Oz)^2_\mu.$$ Then change of variables $u_\mu\equiv (Oz)_\mu$ with an orthogonal matrix $O$ leaves the integration measure invariant, i.e., $d^Nz=d^Nu$. Finally, $$I_K= \int d^Nz\exp \left[ -\frac{1}{2}\sum^N_{\mu=1}\frac{u_\mu^2}{\lambda_\mu}\right]=\prod_{\mu=1}^N\left[ \int_{-\infty}^\infty du_\mu \exp \left( -\frac{u_\mu^2}{2\lambda_\mu}\right)\right] \\ = \prod_{\mu=1}^N\sqrt{2\pi \lambda_\mu}=\sqrt{\prod_{\mu=1}^N(2\pi \lambda_\mu)}=\sqrt{|2\pi K|}.$$


Multivariable Gaussian probability distribution with variance $K_{\mu\nu}$ and mean $s_\mu$ is $$p(z)=\frac{1}{\sqrt{|2\pi K|}}\exp \left[ -\frac{1}{2}\sum^N_{\mu,\nu=1}(z-s)_\mu K^{\mu\nu}(z-s)_\nu\right]$$ where $K^{\mu\nu}\equiv (K^{-1})_{\mu\nu}$.


For example of observables, the moments of the mean-zero multivariable Gaussian distribution is $$\mathbb{E}[z_{\mu_1}\cdots z_{\mu_M}]\equiv \int d^Nz p(z) z_{\mu_1}\cdot z_{\mu_M}=\frac{1}{\sqrt{|2\pi K|}}\int d^Nz\exp \left( -\frac{1}{2}\sum^N_{\mu,\nu=1}(z-s)_\mu K^{\mu\nu}(z-s)_\nu\right) z_{\mu_1}\cdots z_{\mu_M}=\frac{I_{K,(\mu_1,\cdots,\mu_M)}}{I_K}.$$


Let's construct the generating function for the integrals $I_{K,(\mu_1,\cdots,\mu_M)}$ by including $J^\mu$ as $$Z_{K,J}\equiv \int d^Nz \exp \left( -\frac{1}{2}\sum^N_{\mu,\nu=1}(z-s)_\mu K^{\mu\nu}(z-s)_\nu + \sum_{\mu=1}^N J^\mu z_\mu \right).$$ Then the moment can calculated by generating function $$\left. \left[ \frac{d}{dJ^{\mu_1}}\frac{d}{dJ^{\mu_2}}\cdots\frac{d}{dJ^{\mu_M}}Z_{K,J}\right] \right|_{J=0}=\int d^Nz \exp \left( -\frac{1}{2} \sum_{\mu,\nu=1}^N z_\mu K^{\mu\nu} z_\nu\right) z_{\mu_1}\cdots z_{\mu_M} = I_{K,(\mu_1,\cdots, \mu_M)}.$$


To evaluate the generating function $Z_{K,J}$ in a closed form, $$ -\frac{1}{2}\sum^N_{\mu,\nu=1}z_\mu K^{\mu\nu}z_\nu + \sum_{\mu=1}^N J^\mu z_\mu \\ = -\frac{1}{2}\sum^N_{\mu,\nu=1}\left( z_\mu -\sum_{\rho=1}^N K_{\mu\rho}J^\rho\right)_\mu K^{\mu\nu}\left(z_\nu-\sum_{\lambda=1}^NK_{\nu\lambda}J^\lambda\right)_\nu + \frac{1}{2}\sum_{\mu,\nu=1}^NJ^\mu K_{\mu\nu}J^\nu\\ = -\frac{1}{2}\sum_{\mu,\nu=1}^Nw^\mu K_{\mu\nu}w^\nu+\frac{1}{2}\sum_{\mu,\nu=1}^NJ^\mu K_{\mu\nu}J^\nu,$$  where the shifted variable $w_\mu\equiv z_\mu-\sum_{\rho=1}^N K_{\mu\rho}J^\rho$. Using the substitution, the generating function can be evaluated explicitly $$Z_{K,J}=\exp \left( \frac{1}{2} \sum^N_{\mu,\nu=1}J^\mu K_{\mu\nu} J^\nu\right) \int d^N w \exp \left[ -\frac{1}{2} \sum^N_{\mu,\nu=1} w_\mu K^{\mu\nu} w_\nu\right] \\ = \sqrt{|2\pi K|} \exp \left( \frac{1}{2} \sum_{\mu,\nu=1}^N J^\mu K_{\mu\nu}^\nu\right).$$


For $M=2m+1$, vanish.


For $M=2m$, $$\mathbb{E}[z_{\mu_1}\cdots z_{\mu_{2m}}]=\frac{I_{K,(\mu_1,\cdots,\mu_{2m})}}{I_K}=\frac{1}{I_K} \left. \left[ \frac{d}{dJ^{\mu_1}}\cdots \frac{d}{dJ^{\mu_{2m}}}Z_{K,J}\right] \right|_{J=0}\\ =\frac{1}{2^mm!}\frac{d}{dJ^{\mu_1}}\frac{d}{dJ^{\mu_2}}\cdots\frac{d}{dJ^{\mu_{2m}}}\left( \sum_{\mu,\nu=1}^N J^\mu K_{\mu\nu} J^\nu \right)^m.$$


For $2m=2$, $$\mathbb{E}[z_{\mu_1},z_{\mu_2}]=K_{\mu_1\mu_2}.$$


For general $2m$, $$\mathbb{E}[z_{\mu_1}\cdots z_{\mu_{2m}}]=\sum_{\mbox{all pairing}} K_{\mu_{k_1}\mu_{k_2}}\cdots K_{\mu_{k_{2m-1}}\mu_{k_{2m}}},$$ wherem to reiterate, the sum is over all the possible distinct pairings of the $2m$ auxiliary indices under $\mu$ such that the result thas the $(2m-1)!!$ terms that we described above. It called Wick's theorem.

Each factor of the covariance $K_{\mu\nu}$ in a term in sum is called a Wick contraction.


1.2 Probability, Correlation and Statistics, and All That

Given a probability distribution $p(z)$ of an $N$-dimensional random variable $z_\mu$, we can learn about its statistics by measuring functions of $z_\mu$. We'll refer to such measurable functions in a generic sense as observables and denote them as $\mathcal{O}(z)$. The expectation value of an observable $$\mathbb{E}[\mathcal{O}(z)]\equiv \int d^N z p(z)\mathcal{O}(z)$$ characterizes the mean value of the random function $\mathcal{O}(z)$.


The moments or $M$-point correlators of $z_\mu$, given by the expectation $$\mathbb{E}[z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}]=\int d^Nz p(z) z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}.$$ In principle, knowing the $M$-point correlators of a distribution lets us compute the expectation value of any analytic observable $\mathcal{O}(z)$ via Taylor expansion $$\mathbb{E}[\mathcal{O}(z)]=\mathbb{E}\left[ \sum_{M=0}^\infty \frac{1}{M!} \sum_{\mu_1,\cdots,\mu_M=1}^N \left. \frac{\partial^M\mathcal{O}}{\partial z_{\mu_1}\cdots \partial z_{\mu_M}} \right|_{z=0} z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}\right] \\ = \sum_{M=0}^\infty \frac{1}{M!} \sum_{\mu_1,\cdots,\mu_M=1}^N \left. \frac{\partial^M\mathcal{O}}{\partial z_{\mu_1}\cdots \partial z_{\mu_M}} \right|_{z=0} \mathbb{E}[z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}].$$ 


For nearly-Gaussian distributions, a useful set of observables is given by what statisticians call cumulants and physicists call connected correlators.


One-point correlator $$\mathbb{E}[z_\mu]|_{\mbox{connected}}\equiv \mathbb{E}[z_\mu],$$ is just mean.


Two-point correlator $$\mathbb{E}[z_\mu,z_\nu]|_\mbox{connected}\equiv \mathbb{E}[z_\mu z_\nu]-\mathbb{E}[z_\mu] \mathbb{E}[z_\nu] \\ =\mathbb{E}[(z_\mu-\mathbb{E}[z_\mu])(z_\nu-\mathbb{E}[z_\nu])]\equiv \mbox{Cov}[z_\mu,z_\nu],$$ which is also known as the covariance of the distribution. The quantity $\hat{\Delta z_\mu}\equiv z_\mu-\mathbb{E}[z_\mu]$ represents a fluctuation of the random variable around its mean.


For Gaussian distribution, more than two point correlator is zero, but in general, $M$-th moment in terms of connected correlators from degree $1$ to $M$: $$\mathbb{E}[z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}]\equiv \left. \mathbb{E}[z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}]\right|_\mbox{connected} + \sum_\mbox{all subdivisions} \left. \mathbb{E}\left[ z_{\mu_{k_1^{[1]}}}\cdots z_{\mu_{k_{\nu_1}^{[1]}}}\right] \right|_\mbox{connected} \cdots \left. \mathbb{E}\left[ z_{\mu_{k_1^{[s]}}}\cdots z_{\mu_{k_{\nu_1}^{[s]}}}\right] \right|_\mbox{connected},$$ where the sum is over all the possible subdivisions of $M$ variables into $s>1$ clusters of sizes $(\nu_1,\cdots,\nu_s)$ as $(k_1^{[1]},\cdots,k_{\nu_1}^{[1]}),\cdots,(k_1^{[s]},\cdots,k_{\nu_s}^{[s]})$. By decomposing the $M$-th moment into a sum of products of connected correlators of degree $M$ and lower, we see that the connected $M$-point correlator corresponds to a new type of correlation that cannot be expressed by the connected correlators of a lower degree.


1.3 Nearly-Gaussian Distributions

The action $S(z)$ is a function that defines a probability distribution $p(z)$ through the relation $$p(z)\propto e^{-S(z)}.$$ In the statistics literature, the action $S(z)$ is sometimes called the negative log probability. To normalize, $$\int d^Nz p(z)=1.$$ And normalization factor or partition function $$Z\equiv \int d^Nz e^{-S(z)}$$


Quadratic action and the Gaussian distribution

We can identify action as $$S(z)=\frac{1}{2} \sum^N_{\mu,\nu=1} K^{\mu\nu} z_\mu z_\nu,$$ then partition function is $$Z=\int d^Nz e^{-S(z)}=I_K=\sqrt{|2\pi K|}.$$ This quadratic action is the simplest normalizable action and serves as a starting point for defining other distributions. 


For an observable $\mathcal{O}(z)$, define a Gaussian expectation as $$\left\langle \mathcal{O}(z)\right\rangle_K \equiv \frac{1}{\sqrt{|2\pi K|}} \int \left[ \prod_{\mu=1}^N dz_\mu \right] \exp \left( -\frac{1}{2}\sum_{\mu,\nu=1}^N K^{\mu\nu} z_\mu z_\nu\right) \mathcal{O}(z).$$ In particular, with this notation we can write Wick's theorem as $$\left\langle z_{\mu_1}z_{\mu_2}\cdots z_{\mu_{2m}}\right\rangle_K=\sum_{\mbox{all pairing}} K_{\mu_{k_1}\mu_{k_2}}\cdots K_{\mu_{k_{2m-1}}\mu_{k_{2m}}}.$$ (For Gaussian, $\mathbb{E}[\cdot]$ and $\left\langle \cdot\right\rangle$ is same, but for general, it's not.)


Quartic action and perturbation theory

Let's find an action that represent a nealy-Gaussian distribution with a connected four-point correlator that is small but non-vanishing $$\mathbb{E}[z_{\mu_1}z_{\mu_2}z_{\mu_3}z_{\mu_4}]|_\mbox{connected} = O(\epsilon).$$ Here we have introduced a small parameter $\epsilon \ll 1$ and indicated that the correlator should be of order $\epsilon$. For neural networks, we will later fid that the role of the small parameter $\epsilon$ is played by $1/\mbox{width}$.


Then quartic action $$S(z)=\frac{1}{2}\sum_{\mu,\nu=1}^N K^{\mu\nu}z_\mu z_\nu + \frac{\epsilon}{4!} \sum^N_{\mu,\nu,\rho,\lambda=1} V^{\mu\nu\rho\lambda}z_\mu z_\nu z_\rho z_\lambda,$$ where the quartic coupling $\epsilon V^{\mu\nu\rho\lambda}$ is an $(N\times N\times N \times N)$-dimensional tensor that is completely symmetric in all of its four indices.


Partition function is $$Z=\int \left[ \prod_\mu dz_\mu \right] e^{-S(z)}\\ =\int \left[ \prod_\mu dz_\mu \right] \exp \left( -\frac{1}{2}\sum_{\mu,\nu} K^{\mu\nu} z_\mu z_\nu -\frac{\epsilon}{24}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1 \rho_2 \rho_3 \rho_4}z_{\rho_1} z_{\rho_2} z_{\rho_3} z_{\rho_4}\right) =\sqrt{|2\pi K|} \left\langle \exp \left( -\frac{\epsilon}{24} \sum_{\rho_1,\cdots, \rho_4} V^{\rho_1 \rho_2 \rho_3 \rho_4}z_{\rho_1} z_{\rho_2} z_{\rho_3} z_{\rho_4}\right)\right\rangle_K.$$ Now we use perturbation theory for first order of $\epsilon$, $$Z=\sqrt{|2\pi K|} \left\langle 1-\frac{\epsilon}{24}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1\rho_2\rho_3\rho_4}z_{\rho_1}z_{\rho_2}z_{\rho_3}z_{\rho_4}+O(\epsilon^2)\right\rangle_K=\sqrt{|2\pi K|} \left[ 1-\frac{\epsilon}{24}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1\rho_2\rho_3\rho_4}\langle z_{\rho_1}z_{\rho_2}z_{\rho_3}z_{\rho_4}\rangle_K+O(\epsilon^2)\right]=\sqrt{|2\pi K|} \left[ 1-\frac{\epsilon}{24}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1\rho_2\rho_3\rho_4}(K_{\rho_1\rho_2}K_{\rho_3\rho_4}+K_{\rho_1\rho_3}K_{\rho_2\rho_4}+K_{\rho_1\rho_4}K_{\rho_2\rho_3})+O(\epsilon^2)\right]=\sqrt{|2\pi K|} \left[ 1-\frac{\epsilon}{8}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1\rho_2\rho_3\rho_4}K_{\rho_1\rho_2}K_{\rho_3\rho_4}+O(\epsilon^2)\right].$$


Use similar calculation, we can evaluate the two-point correlator: $$\mathbb{E}[z_{\mu_1} z_{\mu_2}]=\frac{1}{Z}\int \left[ \prod_\mu dz_\mu \right] e^{-S(z)} z_{\mu_1}z_{\mu_2}\\ =K_{\mu_1\mu_2}-\frac{\epsilon}{2}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1\rho_2\rho_3\rho_4}K_{\mu_1\rho_1}K_{\mu_2\rho_2}K_{\rho_3\rho_4}+O(\epsilon^2).$$ (We use expansion of $Z$ with $1/(1-x)=1+x+O(x^2)$.)


For four-point correlator, with some calculation, we can get $$\mathbb{E}[z_{\mu_1}z_{\mu_2}z_{\mu_3}z_{\mu_4}]|_\mbox{connected}=-\epsilon \sum_{\rho_1,\cdots,\rho_4}V^{\rho_1\rho_2\rho_3\rho_4}K_{\mu_1\rho_1}K_{\mu_2\rho_2}K_{\mu_3\rho_3}K_{\mu_4\rho_4}+O(\epsilon).$$


This type of expansion is known as the $1/n$ expansion or large-$n$ expansion.

Aside: statistical independence and interactions

Two random variables $x$ and $y$ are statistically independent if thier joint distribution factorizes as $p(x,y)=p(x)p(y).$


Interaction is the breakdown of statistical independence.


Nearly-Gaussian actions

General non-Gaussian action is $$S(z)=\frac{1}{2}\sum_{\mu,\nu=1}^N K^{\mu\nu} z_\mu z_\nu + \sum_{m=2}^k \frac{1}{(2m)!} \sum_{\mu_1,\cdots,\mu_{2m}=1}^Ns^{\mu_1\cdots \mu_{2m}}z_{\mu_1}\cdots z_{\mu_{2m}}.$$


The coefficients $s^{\mu_1\cdots \mu_{2m}}$ are generally known as non-Gaussian couplings, and they control the interactions of the $z_\mu$.