[Deep Learning Principle] 1. Pretraining
This acticle is one of Textbook Comentary Project.
Let's study effective field theory.
1.1 Gaussian Integrals
Single-variable Gaussian integrals
We define the Gaussian probability distribution with unit variance as p(z)\equiv \frac{1}{\sqrt{2\pi}}e^{-\frac{z^2}{2}}, which is now properly normalized, i.e., \int^\infty_{-\infty}dzp(z)=1.
Extend with variance K>0 and mean s, p(z)\equiv \frac{1}{\sqrt{2\pi K}}e^{-\frac{-(z-s)^2}{2K}}, and satisfy \mathbb{E}[z]\equiv s,\ \mathbb{E}[z^2]-\mathbb{E}[z]^2\equiv K.
Given observables \mathcal{O}(z), expectation value is \mathbb{E}[\mathcal{O}(z)]\equiv \int_{-\infty}^\infty dz p(z) \mathcal{O}(z)=\frac{1}{\sqrt{2\pi K}}\int^\infty_{-\infty} dz e^{-\frac{z^2}{2K}}\mathcal{O}(z).
For example, there is observables called moments: \mathbb{E}[z^M]=\frac{1}{\sqrt{2\pi K}}\int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}}z^M. (Odd exponent M vanish) We can evaluate integral I_{K,m}\equiv \int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}}z^{2m}=(2K^2\frac{d}{dK})^m\int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}}=(2K^2\frac{d}{dK})^mI_k=(2K^2\frac{d}{dK})^m\sqrt{2\pi}K^{\frac{1}{2}}=\sqrt{2\pi}K^{\frac{2m+1}{2}}(2m-1)(2m-3)\cdots 1, finally \mathbb{E}[z^{2m}]=\frac{I_{K,m}}{\sqrt{2\pi K}}=K^m (2m-1)!! this is Wick's theorem for single-variable Gaussian distributions.
Given source term J, partition function with source (generating function) is Z_{K,J}\equiv \int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}+Jz}. We can evaluate Z_{K,J} by completing the square in the exponent -\frac{z^2}{2K}+Jz=-\frac{(z-JK)^2}{2K}+\frac{KJ^2}{2},, which lets us rewrite the partition function as Z_{K,J}=e^{\frac{KJ^2}{2}}\int^\infty_{-\infty} dz e^{-\frac{(z-JK)^2}{2K}}=e^{\frac{KJ^2}{2}}I_K=e^{\frac{KJ^2}{2}}\sqrt{2\pi K}.
Source term is good tools to calculate expectation value. For example, to calculate moments, I_{K,m}=\int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}}z^{2m}=\left. \left[ \left(\frac{d}{dJ}\right)^{2m} \int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}+Jz}\right]\right|_{J=0}=\left. \left[ \left( \frac{d}{dJ}\right)^{2m}Z_{K,J}\right] \right|_{J=0}. Then \mathbb{E}[z^{2m}]=\frac{I_{K,m}}{\sqrt{2\pi K}}=\left. \left[ \left( \frac{d}{dJ}\right)^{2m} e^{\frac{KJ^2}{2}}\right] \right|_{J=0}=\left. \left\{ \left( \frac{d}{dJ}\right)^{2m}\left[ \sum_{k=0}^\infty \frac{1}{k!}\left(\frac{K}{2}\right)^kJ^{2k}\right]\right\}\right|_{J=0}=\left(\frac{d}{dJ}\right)^{2m}\left[ \frac{1}{m!}\left(\frac{K}{2}\right)^mJ^{2m}\right] = K^m \frac{(2m)!}{2^m m!}=K^m(2m-1)!!, which is second derivation of Wick's theorem for the single-variable Gaussian distribution.
Mult-ivariable Gaussian Integrals
For N-dimensional variable z_\mu with \mu=1,\cdots,N^2, the multivariable Gaussian function is defined as \exp \left[ -\frac{1}{2}\sum^N_{\mu,\nu=1}z_\mu (K^{-1})_{\mu\nu}z_\nu\right], where the variance or covariance matrix K_{\mu\nu} is an N-by-N symmetric positive definite matrix, and its invers (K^{-1})_{\mu\nu} is defined so that their matrix product gives the N-by-N identity matrix \sum_{\rho=1}^N(K^{-1})_{\mu\rho}K_{\rho\nu}=\delta_{\mu\nu}, where Kronecker delta \delta_{\mu\nu} is representation of identity matrix.
To calculate normalization factor I_K\equiv \int d^Nz\exp \left[ -\frac{1}{2}\sum^N_{\mu,\nu=1}z_\mu (K^{-1})_{\mu\nu}z_\nu\right]\\ =\int_{-\infty}^\infty dz_1 \int_{-\infty}^\infty dz_2 \cdots \int_{-\infty}^\infty dz_N \exp \left[ -\frac{1}{2}\sum^N_{\mu,\nu=1}z_\mu (K^{-1})_{\mu\nu}z_\nu\right], use diagonalizes (OKO^T)_{\mu\nu}=\lambda \delta_{\mu\nu} with orthogonal matrix O_{\mu\nu}, actually, diagonalize its inverse as (OK^{-1})O^T)_{\mu\nu}=(1/\lambda_\nu)\delta_{\mu\nu}. Then we can simplify \sum^N_{\mu,\nu=1}z_\mu (K^{-1})_{\mu\nu}z_\nu=\sum^N_{\mu,\rho,\sigma,\nu=1}z_\mu (O^TO)_{\mu\rho}(K^{-1})_{\rho\sigma}(O^TO)_{\sigma\nu}z_\nu\\ =\sum^N_{\mu,\nu=1}(Oz)_\mu (OK^{-1}O^T)_{\mu\nu}(Oz)_\nu=\sum_{\mu=1}^N\frac{1}{\lambda_\mu}(Oz)^2_\mu. Then change of variables u_\mu\equiv (Oz)_\mu with an orthogonal matrix O leaves the integration measure invariant, i.e., d^Nz=d^Nu. Finally, I_K= \int d^Nz\exp \left[ -\frac{1}{2}\sum^N_{\mu=1}\frac{u_\mu^2}{\lambda_\mu}\right]=\prod_{\mu=1}^N\left[ \int_{-\infty}^\infty du_\mu \exp \left( -\frac{u_\mu^2}{2\lambda_\mu}\right)\right] \\ = \prod_{\mu=1}^N\sqrt{2\pi \lambda_\mu}=\sqrt{\prod_{\mu=1}^N(2\pi \lambda_\mu)}=\sqrt{|2\pi K|}.
Multivariable Gaussian probability distribution with variance K_{\mu\nu} and mean s_\mu is p(z)=\frac{1}{\sqrt{|2\pi K|}}\exp \left[ -\frac{1}{2}\sum^N_{\mu,\nu=1}(z-s)_\mu K^{\mu\nu}(z-s)_\nu\right] where K^{\mu\nu}\equiv (K^{-1})_{\mu\nu}.
For example of observables, the moments of the mean-zero multivariable Gaussian distribution is \mathbb{E}[z_{\mu_1}\cdots z_{\mu_M}]\equiv \int d^Nz p(z) z_{\mu_1}\cdot z_{\mu_M}=\frac{1}{\sqrt{|2\pi K|}}\int d^Nz\exp \left( -\frac{1}{2}\sum^N_{\mu,\nu=1}(z-s)_\mu K^{\mu\nu}(z-s)_\nu\right) z_{\mu_1}\cdots z_{\mu_M}=\frac{I_{K,(\mu_1,\cdots,\mu_M)}}{I_K}.
Let's construct the generating function for the integrals I_{K,(\mu_1,\cdots,\mu_M)} by including J^\mu as Z_{K,J}\equiv \int d^Nz \exp \left( -\frac{1}{2}\sum^N_{\mu,\nu=1}(z-s)_\mu K^{\mu\nu}(z-s)_\nu + \sum_{\mu=1}^N J^\mu z_\mu \right). Then the moment can calculated by generating function \left. \left[ \frac{d}{dJ^{\mu_1}}\frac{d}{dJ^{\mu_2}}\cdots\frac{d}{dJ^{\mu_M}}Z_{K,J}\right] \right|_{J=0}=\int d^Nz \exp \left( -\frac{1}{2} \sum_{\mu,\nu=1}^N z_\mu K^{\mu\nu} z_\nu\right) z_{\mu_1}\cdots z_{\mu_M} = I_{K,(\mu_1,\cdots, \mu_M)}.
To evaluate the generating function Z_{K,J} in a closed form, -\frac{1}{2}\sum^N_{\mu,\nu=1}z_\mu K^{\mu\nu}z_\nu + \sum_{\mu=1}^N J^\mu z_\mu \\ = -\frac{1}{2}\sum^N_{\mu,\nu=1}\left( z_\mu -\sum_{\rho=1}^N K_{\mu\rho}J^\rho\right)_\mu K^{\mu\nu}\left(z_\nu-\sum_{\lambda=1}^NK_{\nu\lambda}J^\lambda\right)_\nu + \frac{1}{2}\sum_{\mu,\nu=1}^NJ^\mu K_{\mu\nu}J^\nu\\ = -\frac{1}{2}\sum_{\mu,\nu=1}^Nw^\mu K_{\mu\nu}w^\nu+\frac{1}{2}\sum_{\mu,\nu=1}^NJ^\mu K_{\mu\nu}J^\nu, where the shifted variable w_\mu\equiv z_\mu-\sum_{\rho=1}^N K_{\mu\rho}J^\rho. Using the substitution, the generating function can be evaluated explicitly Z_{K,J}=\exp \left( \frac{1}{2} \sum^N_{\mu,\nu=1}J^\mu K_{\mu\nu} J^\nu\right) \int d^N w \exp \left[ -\frac{1}{2} \sum^N_{\mu,\nu=1} w_\mu K^{\mu\nu} w_\nu\right] \\ = \sqrt{|2\pi K|} \exp \left( \frac{1}{2} \sum_{\mu,\nu=1}^N J^\mu K_{\mu\nu}^\nu\right).
For M=2m+1, vanish.
For M=2m, \mathbb{E}[z_{\mu_1}\cdots z_{\mu_{2m}}]=\frac{I_{K,(\mu_1,\cdots,\mu_{2m})}}{I_K}=\frac{1}{I_K} \left. \left[ \frac{d}{dJ^{\mu_1}}\cdots \frac{d}{dJ^{\mu_{2m}}}Z_{K,J}\right] \right|_{J=0}\\ =\frac{1}{2^mm!}\frac{d}{dJ^{\mu_1}}\frac{d}{dJ^{\mu_2}}\cdots\frac{d}{dJ^{\mu_{2m}}}\left( \sum_{\mu,\nu=1}^N J^\mu K_{\mu\nu} J^\nu \right)^m.
For 2m=2, \mathbb{E}[z_{\mu_1},z_{\mu_2}]=K_{\mu_1\mu_2}.
For general 2m, \mathbb{E}[z_{\mu_1}\cdots z_{\mu_{2m}}]=\sum_{\mbox{all pairing}} K_{\mu_{k_1}\mu_{k_2}}\cdots K_{\mu_{k_{2m-1}}\mu_{k_{2m}}}, wherem to reiterate, the sum is over all the possible distinct pairings of the 2m auxiliary indices under \mu such that the result thas the (2m-1)!! terms that we described above. It called Wick's theorem.
Each factor of the covariance K_{\mu\nu} in a term in sum is called a Wick contraction.
1.2 Probability, Correlation and Statistics, and All That
Given a probability distribution p(z) of an N-dimensional random variable z_\mu, we can learn about its statistics by measuring functions of z_\mu. We'll refer to such measurable functions in a generic sense as observables and denote them as \mathcal{O}(z). The expectation value of an observable \mathbb{E}[\mathcal{O}(z)]\equiv \int d^N z p(z)\mathcal{O}(z) characterizes the mean value of the random function \mathcal{O}(z).
The moments or M-point correlators of z_\mu, given by the expectation \mathbb{E}[z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}]=\int d^Nz p(z) z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}. In principle, knowing the M-point correlators of a distribution lets us compute the expectation value of any analytic observable \mathcal{O}(z) via Taylor expansion \mathbb{E}[\mathcal{O}(z)]=\mathbb{E}\left[ \sum_{M=0}^\infty \frac{1}{M!} \sum_{\mu_1,\cdots,\mu_M=1}^N \left. \frac{\partial^M\mathcal{O}}{\partial z_{\mu_1}\cdots \partial z_{\mu_M}} \right|_{z=0} z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}\right] \\ = \sum_{M=0}^\infty \frac{1}{M!} \sum_{\mu_1,\cdots,\mu_M=1}^N \left. \frac{\partial^M\mathcal{O}}{\partial z_{\mu_1}\cdots \partial z_{\mu_M}} \right|_{z=0} \mathbb{E}[z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}].
For nearly-Gaussian distributions, a useful set of observables is given by what statisticians call cumulants and physicists call connected correlators.
One-point correlator \mathbb{E}[z_\mu]|_{\mbox{connected}}\equiv \mathbb{E}[z_\mu], is just mean.
Two-point correlator \mathbb{E}[z_\mu,z_\nu]|_\mbox{connected}\equiv \mathbb{E}[z_\mu z_\nu]-\mathbb{E}[z_\mu] \mathbb{E}[z_\nu] \\ =\mathbb{E}[(z_\mu-\mathbb{E}[z_\mu])(z_\nu-\mathbb{E}[z_\nu])]\equiv \mbox{Cov}[z_\mu,z_\nu], which is also known as the covariance of the distribution. The quantity \hat{\Delta z_\mu}\equiv z_\mu-\mathbb{E}[z_\mu] represents a fluctuation of the random variable around its mean.
For Gaussian distribution, more than two point correlator is zero, but in general, M-th moment in terms of connected correlators from degree 1 to M: \mathbb{E}[z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}]\equiv \left. \mathbb{E}[z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}]\right|_\mbox{connected} + \sum_\mbox{all subdivisions} \left. \mathbb{E}\left[ z_{\mu_{k_1^{[1]}}}\cdots z_{\mu_{k_{\nu_1}^{[1]}}}\right] \right|_\mbox{connected} \cdots \left. \mathbb{E}\left[ z_{\mu_{k_1^{[s]}}}\cdots z_{\mu_{k_{\nu_1}^{[s]}}}\right] \right|_\mbox{connected}, where the sum is over all the possible subdivisions of M variables into s>1 clusters of sizes (\nu_1,\cdots,\nu_s) as (k_1^{[1]},\cdots,k_{\nu_1}^{[1]}),\cdots,(k_1^{[s]},\cdots,k_{\nu_s}^{[s]}). By decomposing the M-th moment into a sum of products of connected correlators of degree M and lower, we see that the connected M-point correlator corresponds to a new type of correlation that cannot be expressed by the connected correlators of a lower degree.
1.3 Nearly-Gaussian Distributions
The action S(z) is a function that defines a probability distribution p(z) through the relation p(z)\propto e^{-S(z)}. In the statistics literature, the action S(z) is sometimes called the negative log probability. To normalize, \int d^Nz p(z)=1. And normalization factor or partition function Z\equiv \int d^Nz e^{-S(z)}
Quadratic action and the Gaussian distribution
We can identify action as S(z)=\frac{1}{2} \sum^N_{\mu,\nu=1} K^{\mu\nu} z_\mu z_\nu, then partition function is Z=\int d^Nz e^{-S(z)}=I_K=\sqrt{|2\pi K|}. This quadratic action is the simplest normalizable action and serves as a starting point for defining other distributions.
For an observable \mathcal{O}(z), define a Gaussian expectation as \left\langle \mathcal{O}(z)\right\rangle_K \equiv \frac{1}{\sqrt{|2\pi K|}} \int \left[ \prod_{\mu=1}^N dz_\mu \right] \exp \left( -\frac{1}{2}\sum_{\mu,\nu=1}^N K^{\mu\nu} z_\mu z_\nu\right) \mathcal{O}(z). In particular, with this notation we can write Wick's theorem as \left\langle z_{\mu_1}z_{\mu_2}\cdots z_{\mu_{2m}}\right\rangle_K=\sum_{\mbox{all pairing}} K_{\mu_{k_1}\mu_{k_2}}\cdots K_{\mu_{k_{2m-1}}\mu_{k_{2m}}}. (For Gaussian, \mathbb{E}[\cdot] and \left\langle \cdot\right\rangle is same, but for general, it's not.)
Quartic action and perturbation theory
Let's find an action that represent a nealy-Gaussian distribution with a connected four-point correlator that is small but non-vanishing \mathbb{E}[z_{\mu_1}z_{\mu_2}z_{\mu_3}z_{\mu_4}]|_\mbox{connected} = O(\epsilon). Here we have introduced a small parameter \epsilon \ll 1 and indicated that the correlator should be of order \epsilon. For neural networks, we will later fid that the role of the small parameter \epsilon is played by 1/\mbox{width}.
Then quartic action S(z)=\frac{1}{2}\sum_{\mu,\nu=1}^N K^{\mu\nu}z_\mu z_\nu + \frac{\epsilon}{4!} \sum^N_{\mu,\nu,\rho,\lambda=1} V^{\mu\nu\rho\lambda}z_\mu z_\nu z_\rho z_\lambda, where the quartic coupling \epsilon V^{\mu\nu\rho\lambda} is an (N\times N\times N \times N)-dimensional tensor that is completely symmetric in all of its four indices.
Partition function is Z=\int \left[ \prod_\mu dz_\mu \right] e^{-S(z)}\\ =\int \left[ \prod_\mu dz_\mu \right] \exp \left( -\frac{1}{2}\sum_{\mu,\nu} K^{\mu\nu} z_\mu z_\nu -\frac{\epsilon}{24}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1 \rho_2 \rho_3 \rho_4}z_{\rho_1} z_{\rho_2} z_{\rho_3} z_{\rho_4}\right) =\sqrt{|2\pi K|} \left\langle \exp \left( -\frac{\epsilon}{24} \sum_{\rho_1,\cdots, \rho_4} V^{\rho_1 \rho_2 \rho_3 \rho_4}z_{\rho_1} z_{\rho_2} z_{\rho_3} z_{\rho_4}\right)\right\rangle_K. Now we use perturbation theory for first order of \epsilon, Z=\sqrt{|2\pi K|} \left\langle 1-\frac{\epsilon}{24}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1\rho_2\rho_3\rho_4}z_{\rho_1}z_{\rho_2}z_{\rho_3}z_{\rho_4}+O(\epsilon^2)\right\rangle_K=\sqrt{|2\pi K|} \left[ 1-\frac{\epsilon}{24}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1\rho_2\rho_3\rho_4}\langle z_{\rho_1}z_{\rho_2}z_{\rho_3}z_{\rho_4}\rangle_K+O(\epsilon^2)\right]=\sqrt{|2\pi K|} \left[ 1-\frac{\epsilon}{24}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1\rho_2\rho_3\rho_4}(K_{\rho_1\rho_2}K_{\rho_3\rho_4}+K_{\rho_1\rho_3}K_{\rho_2\rho_4}+K_{\rho_1\rho_4}K_{\rho_2\rho_3})+O(\epsilon^2)\right]=\sqrt{|2\pi K|} \left[ 1-\frac{\epsilon}{8}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1\rho_2\rho_3\rho_4}K_{\rho_1\rho_2}K_{\rho_3\rho_4}+O(\epsilon^2)\right].
Use similar calculation, we can evaluate the two-point correlator: \mathbb{E}[z_{\mu_1} z_{\mu_2}]=\frac{1}{Z}\int \left[ \prod_\mu dz_\mu \right] e^{-S(z)} z_{\mu_1}z_{\mu_2}\\ =K_{\mu_1\mu_2}-\frac{\epsilon}{2}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1\rho_2\rho_3\rho_4}K_{\mu_1\rho_1}K_{\mu_2\rho_2}K_{\rho_3\rho_4}+O(\epsilon^2). (We use expansion of Z with 1/(1-x)=1+x+O(x^2).)
For four-point correlator, with some calculation, we can get \mathbb{E}[z_{\mu_1}z_{\mu_2}z_{\mu_3}z_{\mu_4}]|_\mbox{connected}=-\epsilon \sum_{\rho_1,\cdots,\rho_4}V^{\rho_1\rho_2\rho_3\rho_4}K_{\mu_1\rho_1}K_{\mu_2\rho_2}K_{\mu_3\rho_3}K_{\mu_4\rho_4}+O(\epsilon).
This type of expansion is known as the 1/n expansion or large-n expansion.
Aside: statistical independence and interactions
Two random variables x and y are statistically independent if thier joint distribution factorizes as p(x,y)=p(x)p(y).
Interaction is the breakdown of statistical independence.
Nearly-Gaussian actions
General non-Gaussian action is S(z)=\frac{1}{2}\sum_{\mu,\nu=1}^N K^{\mu\nu} z_\mu z_\nu + \sum_{m=2}^k \frac{1}{(2m)!} \sum_{\mu_1,\cdots,\mu_{2m}=1}^Ns^{\mu_1\cdots \mu_{2m}}z_{\mu_1}\cdots z_{\mu_{2m}}.
The coefficients s^{\mu_1\cdots \mu_{2m}} are generally known as non-Gaussian couplings, and they control the interactions of the z_\mu.