[Deep Learning Principle] 1. Pretraining

This acticle is one of Textbook Comentary Project.

Let's study effective field theory.

1.1 Gaussian Integrals

Single-variable Gaussian integrals

We define the Gaussian probability distribution with unit variance as $p(z)\equiv \frac{1}{\sqrt{2\pi}}e^{-\frac{z^2}{2}},$ which is now properly normalized, i.e., $\int^\infty_{-\infty}dzp(z)=1$ .

Extend with variance $K>0$ and mean $s$ , $p(z)\equiv \frac{1}{\sqrt{2\pi K}}e^{-\frac{-(z-s)^2}{2K}},$ and satisfy $\mathbb{E}[z]\equiv s,\ \mathbb{E}[z^2]-\mathbb{E}[z]^2\equiv K.$

Given observables $\mathcal{O}(z)$ , expectation value is $\mathbb{E}[\mathcal{O}(z)]\equiv \int_{-\infty}^\infty dz p(z) \mathcal{O}(z)=\frac{1}{\sqrt{2\pi K}}\int^\infty_{-\infty} dz e^{-\frac{z^2}{2K}}\mathcal{O}(z).$

For example, there is observables called moments: $\mathbb{E}[z^M]=\frac{1}{\sqrt{2\pi K}}\int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}}z^M.$ (Odd exponent $M$ vanish) We can evaluate integral $I_{K,m}\equiv \int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}}z^{2m}=(2K^2\frac{d}{dK})^m\int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}}=(2K^2\frac{d}{dK})^mI_k=(2K^2\frac{d}{dK})^m\sqrt{2\pi}K^{\frac{1}{2}}=\sqrt{2\pi}K^{\frac{2m+1}{2}}(2m-1)(2m-3)\cdots 1,$ finally $\mathbb{E}[z^{2m}]=\frac{I_{K,m}}{\sqrt{2\pi K}}=K^m (2m-1)!!$ this is Wick's theorem for single-variable Gaussian distributions.

Given source term $J$ , partition function with source (generating function) is $Z_{K,J}\equiv \int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}+Jz}.$ We can evaluate $Z_{K,J}$ by completing the square in the exponent $-\frac{z^2}{2K}+Jz=-\frac{(z-JK)^2}{2K}+\frac{KJ^2}{2},$ , which lets us rewrite the partition function as $Z_{K,J}=e^{\frac{KJ^2}{2}}\int^\infty_{-\infty} dz e^{-\frac{(z-JK)^2}{2K}}=e^{\frac{KJ^2}{2}}I_K=e^{\frac{KJ^2}{2}}\sqrt{2\pi K}.$

Source term is good tools to calculate expectation value. For example, to calculate moments, $I_{K,m}=\int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}}z^{2m}=\left. \left[ \left(\frac{d}{dJ}\right)^{2m} \int_{-\infty}^\infty dz e^{-\frac{z^2}{2K}+Jz}\right]\right|_{J=0}=\left. \left[ \left( \frac{d}{dJ}\right)^{2m}Z_{K,J}\right] \right|_{J=0}.$ Then $\mathbb{E}[z^{2m}]=\frac{I_{K,m}}{\sqrt{2\pi K}}=\left. \left[ \left( \frac{d}{dJ}\right)^{2m} e^{\frac{KJ^2}{2}}\right] \right|_{J=0}=\left. \left\{ \left( \frac{d}{dJ}\right)^{2m}\left[ \sum_{k=0}^\infty \frac{1}{k!}\left(\frac{K}{2}\right)^kJ^{2k}\right]\right\}\right|_{J=0}=\left(\frac{d}{dJ}\right)^{2m}\left[ \frac{1}{m!}\left(\frac{K}{2}\right)^mJ^{2m}\right] = K^m \frac{(2m)!}{2^m m!}=K^m(2m-1)!!,$ which is second derivation of Wick's theorem for the single-variable Gaussian distribution.

Mult-ivariable Gaussian Integrals

For $N$ -dimensional variable $z_\mu$ with $\mu=1,\cdots,N^2$ , the multivariable Gaussian function is defined as $\exp \left[ -\frac{1}{2}\sum^N_{\mu,\nu=1}z_\mu (K^{-1})_{\mu\nu}z_\nu\right],$ where the variance or covariance matrix $K_{\mu\nu}$ is an $N$ -by- $N$ symmetric positive definite matrix, and its invers $(K^{-1})_{\mu\nu}$ is defined so that their matrix product gives the $N$ -by- $N$ identity matrix $\sum_{\rho=1}^N(K^{-1})_{\mu\rho}K_{\rho\nu}=\delta_{\mu\nu},$ where Kronecker delta $\delta_{\mu\nu}$ is representation of identity matrix.

To calculate normalization factor $I_K\equiv \int d^Nz\exp \left[ -\frac{1}{2}\sum^N_{\mu,\nu=1}z_\mu (K^{-1})_{\mu\nu}z_\nu\right]\\ =\int_{-\infty}^\infty dz_1 \int_{-\infty}^\infty dz_2 \cdots \int_{-\infty}^\infty dz_N \exp \left[ -\frac{1}{2}\sum^N_{\mu,\nu=1}z_\mu (K^{-1})_{\mu\nu}z_\nu\right],$ use diagonalizes $(OKO^T)_{\mu\nu}=\lambda \delta_{\mu\nu}$ with orthogonal matrix $O_{\mu\nu}$ , actually, diagonalize its inverse as $(OK^{-1})O^T)_{\mu\nu}=(1/\lambda_\nu)\delta_{\mu\nu}$ . Then we can simplify $\sum^N_{\mu,\nu=1}z_\mu (K^{-1})_{\mu\nu}z_\nu=\sum^N_{\mu,\rho,\sigma,\nu=1}z_\mu (O^TO)_{\mu\rho}(K^{-1})_{\rho\sigma}(O^TO)_{\sigma\nu}z_\nu\\ =\sum^N_{\mu,\nu=1}(Oz)_\mu (OK^{-1}O^T)_{\mu\nu}(Oz)_\nu=\sum_{\mu=1}^N\frac{1}{\lambda_\mu}(Oz)^2_\mu.$ Then change of variables $u_\mu\equiv (Oz)_\mu$ with an orthogonal matrix $O$ leaves the integration measure invariant, i.e., $d^Nz=d^Nu$ . Finally, $I_K= \int d^Nz\exp \left[ -\frac{1}{2}\sum^N_{\mu=1}\frac{u_\mu^2}{\lambda_\mu}\right]=\prod_{\mu=1}^N\left[ \int_{-\infty}^\infty du_\mu \exp \left( -\frac{u_\mu^2}{2\lambda_\mu}\right)\right] \\ = \prod_{\mu=1}^N\sqrt{2\pi \lambda_\mu}=\sqrt{\prod_{\mu=1}^N(2\pi \lambda_\mu)}=\sqrt{|2\pi K|}.$

Multivariable Gaussian probability distribution with variance $K_{\mu\nu}$ and mean $s_\mu$ is $p(z)=\frac{1}{\sqrt{|2\pi K|}}\exp \left[ -\frac{1}{2}\sum^N_{\mu,\nu=1}(z-s)_\mu K^{\mu\nu}(z-s)_\nu\right]$ where $K^{\mu\nu}\equiv (K^{-1})_{\mu\nu}$ .

For example of observables, the moments of the mean-zero multivariable Gaussian distribution is $\mathbb{E}[z_{\mu_1}\cdots z_{\mu_M}]\equiv \int d^Nz p(z) z_{\mu_1}\cdot z_{\mu_M}=\frac{1}{\sqrt{|2\pi K|}}\int d^Nz\exp \left( -\frac{1}{2}\sum^N_{\mu,\nu=1}(z-s)_\mu K^{\mu\nu}(z-s)_\nu\right) z_{\mu_1}\cdots z_{\mu_M}=\frac{I_{K,(\mu_1,\cdots,\mu_M)}}{I_K}.$

Let's construct the generating function for the integrals $I_{K,(\mu_1,\cdots,\mu_M)}$ by including $J^\mu$ as $Z_{K,J}\equiv \int d^Nz \exp \left( -\frac{1}{2}\sum^N_{\mu,\nu=1}(z-s)_\mu K^{\mu\nu}(z-s)_\nu + \sum_{\mu=1}^N J^\mu z_\mu \right).$ Then the moment can calculated by generating function $\left. \left[ \frac{d}{dJ^{\mu_1}}\frac{d}{dJ^{\mu_2}}\cdots\frac{d}{dJ^{\mu_M}}Z_{K,J}\right] \right|_{J=0}=\int d^Nz \exp \left( -\frac{1}{2} \sum_{\mu,\nu=1}^N z_\mu K^{\mu\nu} z_\nu\right) z_{\mu_1}\cdots z_{\mu_M} = I_{K,(\mu_1,\cdots, \mu_M)}.$

To evaluate the generating function $Z_{K,J}$ in a closed form, $-\frac{1}{2}\sum^N_{\mu,\nu=1}z_\mu K^{\mu\nu}z_\nu + \sum_{\mu=1}^N J^\mu z_\mu \\ = -\frac{1}{2}\sum^N_{\mu,\nu=1}\left( z_\mu -\sum_{\rho=1}^N K_{\mu\rho}J^\rho\right)_\mu K^{\mu\nu}\left(z_\nu-\sum_{\lambda=1}^NK_{\nu\lambda}J^\lambda\right)_\nu + \frac{1}{2}\sum_{\mu,\nu=1}^NJ^\mu K_{\mu\nu}J^\nu\\ = -\frac{1}{2}\sum_{\mu,\nu=1}^Nw^\mu K_{\mu\nu}w^\nu+\frac{1}{2}\sum_{\mu,\nu=1}^NJ^\mu K_{\mu\nu}J^\nu,$ where the shifted variable $w_\mu\equiv z_\mu-\sum_{\rho=1}^N K_{\mu\rho}J^\rho$ . Using the substitution, the generating function can be evaluated explicitly $Z_{K,J}=\exp \left( \frac{1}{2} \sum^N_{\mu,\nu=1}J^\mu K_{\mu\nu} J^\nu\right) \int d^N w \exp \left[ -\frac{1}{2} \sum^N_{\mu,\nu=1} w_\mu K^{\mu\nu} w_\nu\right] \\ = \sqrt{|2\pi K|} \exp \left( \frac{1}{2} \sum_{\mu,\nu=1}^N J^\mu K_{\mu\nu}^\nu\right).$

For $M=2m+1$ , vanish.

For $M=2m$ , $\mathbb{E}[z_{\mu_1}\cdots z_{\mu_{2m}}]=\frac{I_{K,(\mu_1,\cdots,\mu_{2m})}}{I_K}=\frac{1}{I_K} \left. \left[ \frac{d}{dJ^{\mu_1}}\cdots \frac{d}{dJ^{\mu_{2m}}}Z_{K,J}\right] \right|_{J=0}\\ =\frac{1}{2^mm!}\frac{d}{dJ^{\mu_1}}\frac{d}{dJ^{\mu_2}}\cdots\frac{d}{dJ^{\mu_{2m}}}\left( \sum_{\mu,\nu=1}^N J^\mu K_{\mu\nu} J^\nu \right)^m.$

For $2m=2$ , $\mathbb{E}[z_{\mu_1},z_{\mu_2}]=K_{\mu_1\mu_2}.$

For general $2m$ , $\mathbb{E}[z_{\mu_1}\cdots z_{\mu_{2m}}]=\sum_{\mbox{all pairing}} K_{\mu_{k_1}\mu_{k_2}}\cdots K_{\mu_{k_{2m-1}}\mu_{k_{2m}}},$ wherem to reiterate, the sum is over all the possible distinct pairings of the $2m$ auxiliary indices under $\mu$ such that the result thas the $(2m-1)!!$ terms that we described above. It called Wick's theorem.

Each factor of the covariance $K_{\mu\nu}$ in a term in sum is called a Wick contraction.

1.2 Probability, Correlation and Statistics, and All That

Given a probability distribution $p(z)$ of an $N$ -dimensional random variable $z_\mu$ , we can learn about its statistics by measuring functions of $z_\mu$ . We'll refer to such measurable functions in a generic sense as observables and denote them as $\mathcal{O}(z)$ . The expectation value of an observable $\mathbb{E}[\mathcal{O}(z)]\equiv \int d^N z p(z)\mathcal{O}(z)$ characterizes the mean value of the random function $\mathcal{O}(z)$ .

The moments or $M$ -point correlators of $z_\mu$ , given by the expectation $\mathbb{E}[z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}]=\int d^Nz p(z) z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}.$ In principle, knowing the $M$ -point correlators of a distribution lets us compute the expectation value of any analytic observable $\mathcal{O}(z)$ via Taylor expansion $\mathbb{E}[\mathcal{O}(z)]=\mathbb{E}\left[ \sum_{M=0}^\infty \frac{1}{M!} \sum_{\mu_1,\cdots,\mu_M=1}^N \left. \frac{\partial^M\mathcal{O}}{\partial z_{\mu_1}\cdots \partial z_{\mu_M}} \right|_{z=0} z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}\right] \\ = \sum_{M=0}^\infty \frac{1}{M!} \sum_{\mu_1,\cdots,\mu_M=1}^N \left. \frac{\partial^M\mathcal{O}}{\partial z_{\mu_1}\cdots \partial z_{\mu_M}} \right|_{z=0} \mathbb{E}[z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}].$

For nearly-Gaussian distributions, a useful set of observables is given by what statisticians call cumulants and physicists call connected correlators.

One-point correlator $\mathbb{E}[z_\mu]|_{\mbox{connected}}\equiv \mathbb{E}[z_\mu],$ is just mean.

Two-point correlator $\mathbb{E}[z_\mu,z_\nu]|_\mbox{connected}\equiv \mathbb{E}[z_\mu z_\nu]-\mathbb{E}[z_\mu] \mathbb{E}[z_\nu] \\ =\mathbb{E}[(z_\mu-\mathbb{E}[z_\mu])(z_\nu-\mathbb{E}[z_\nu])]\equiv \mbox{Cov}[z_\mu,z_\nu],$ which is also known as the covariance of the distribution. The quantity $\hat{\Delta z_\mu}\equiv z_\mu-\mathbb{E}[z_\mu]$ represents a fluctuation of the random variable around its mean.

For Gaussian distribution, more than two point correlator is zero, but in general, $M$ -th moment in terms of connected correlators from degree $1$ to $M$ : $\mathbb{E}[z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}]\equiv \left. \mathbb{E}[z_{\mu_1}z_{\mu_2}\cdots z_{\mu_M}]\right|_\mbox{connected} + \sum_\mbox{all subdivisions} \left. \mathbb{E}\left[ z_{\mu_{k_1^{[1]}}}\cdots z_{\mu_{k_{\nu_1}^{[1]}}}\right] \right|_\mbox{connected} \cdots \left. \mathbb{E}\left[ z_{\mu_{k_1^{[s]}}}\cdots z_{\mu_{k_{\nu_1}^{[s]}}}\right] \right|_\mbox{connected},$ where the sum is over all the possible subdivisions of $M$ variables into $s>1$ clusters of sizes $(\nu_1,\cdots,\nu_s)$ as $(k_1^{[1]},\cdots,k_{\nu_1}^{[1]}),\cdots,(k_1^{[s]},\cdots,k_{\nu_s}^{[s]})$ . By decomposing the $M$ -th moment into a sum of products of connected correlators of degree $M$ and lower, we see that the connected $M$ -point correlator corresponds to a new type of correlation that cannot be expressed by the connected correlators of a lower degree.

1.3 Nearly-Gaussian Distributions

The action $S(z)$ is a function that defines a probability distribution $p(z)$ through the relation $p(z)\propto e^{-S(z)}.$ In the statistics literature, the action $S(z)$ is sometimes called the negative log probability. To normalize, $\int d^Nz p(z)=1.$ And normalization factor or partition function $Z\equiv \int d^Nz e^{-S(z)}$

Quadratic action and the Gaussian distribution

We can identify action as $S(z)=\frac{1}{2} \sum^N_{\mu,\nu=1} K^{\mu\nu} z_\mu z_\nu,$ then partition function is $Z=\int d^Nz e^{-S(z)}=I_K=\sqrt{|2\pi K|}.$ This quadratic action is the simplest normalizable action and serves as a starting point for defining other distributions.

For an observable $\mathcal{O}(z)$ , define a Gaussian expectation as $\left\langle \mathcal{O}(z)\right\rangle_K \equiv \frac{1}{\sqrt{|2\pi K|}} \int \left[ \prod_{\mu=1}^N dz_\mu \right] \exp \left( -\frac{1}{2}\sum_{\mu,\nu=1}^N K^{\mu\nu} z_\mu z_\nu\right) \mathcal{O}(z).$ In particular, with this notation we can write Wick's theorem as $\left\langle z_{\mu_1}z_{\mu_2}\cdots z_{\mu_{2m}}\right\rangle_K=\sum_{\mbox{all pairing}} K_{\mu_{k_1}\mu_{k_2}}\cdots K_{\mu_{k_{2m-1}}\mu_{k_{2m}}}.$ (For Gaussian, $\mathbb{E}[\cdot]$ and $\left\langle \cdot\right\rangle$ is same, but for general, it's not.)

Quartic action and perturbation theory

Let's find an action that represent a nealy-Gaussian distribution with a connected four-point correlator that is small but non-vanishing $\mathbb{E}[z_{\mu_1}z_{\mu_2}z_{\mu_3}z_{\mu_4}]|_\mbox{connected} = O(\epsilon).$ Here we have introduced a small parameter $\epsilon \ll 1$ and indicated that the correlator should be of order $\epsilon$ . For neural networks, we will later fid that the role of the small parameter $\epsilon$ is played by $1/\mbox{width}$ .

Then quartic action $S(z)=\frac{1}{2}\sum_{\mu,\nu=1}^N K^{\mu\nu}z_\mu z_\nu + \frac{\epsilon}{4!} \sum^N_{\mu,\nu,\rho,\lambda=1} V^{\mu\nu\rho\lambda}z_\mu z_\nu z_\rho z_\lambda,$ where the quartic coupling $\epsilon V^{\mu\nu\rho\lambda}$ is an $(N\times N\times N \times N)$ -dimensional tensor that is completely symmetric in all of its four indices.

Partition function is $Z=\int \left[ \prod_\mu dz_\mu \right] e^{-S(z)}\\ =\int \left[ \prod_\mu dz_\mu \right] \exp \left( -\frac{1}{2}\sum_{\mu,\nu} K^{\mu\nu} z_\mu z_\nu -\frac{\epsilon}{24}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1 \rho_2 \rho_3 \rho_4}z_{\rho_1} z_{\rho_2} z_{\rho_3} z_{\rho_4}\right) =\sqrt{|2\pi K|} \left\langle \exp \left( -\frac{\epsilon}{24} \sum_{\rho_1,\cdots, \rho_4} V^{\rho_1 \rho_2 \rho_3 \rho_4}z_{\rho_1} z_{\rho_2} z_{\rho_3} z_{\rho_4}\right)\right\rangle_K.$ Now we use perturbation theory for first order of $\epsilon$ , $Z=\sqrt{|2\pi K|} \left\langle 1-\frac{\epsilon}{24}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1\rho_2\rho_3\rho_4}z_{\rho_1}z_{\rho_2}z_{\rho_3}z_{\rho_4}+O(\epsilon^2)\right\rangle_K=\sqrt{|2\pi K|} \left[ 1-\frac{\epsilon}{24}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1\rho_2\rho_3\rho_4}\langle z_{\rho_1}z_{\rho_2}z_{\rho_3}z_{\rho_4}\rangle_K+O(\epsilon^2)\right]=\sqrt{|2\pi K|} \left[ 1-\frac{\epsilon}{24}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1\rho_2\rho_3\rho_4}(K_{\rho_1\rho_2}K_{\rho_3\rho_4}+K_{\rho_1\rho_3}K_{\rho_2\rho_4}+K_{\rho_1\rho_4}K_{\rho_2\rho_3})+O(\epsilon^2)\right]=\sqrt{|2\pi K|} \left[ 1-\frac{\epsilon}{8}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1\rho_2\rho_3\rho_4}K_{\rho_1\rho_2}K_{\rho_3\rho_4}+O(\epsilon^2)\right].$

Use similar calculation, we can evaluate the two-point correlator: $\mathbb{E}[z_{\mu_1} z_{\mu_2}]=\frac{1}{Z}\int \left[ \prod_\mu dz_\mu \right] e^{-S(z)} z_{\mu_1}z_{\mu_2}\\ =K_{\mu_1\mu_2}-\frac{\epsilon}{2}\sum_{\rho_1,\cdots,\rho_4} V^{\rho_1\rho_2\rho_3\rho_4}K_{\mu_1\rho_1}K_{\mu_2\rho_2}K_{\rho_3\rho_4}+O(\epsilon^2).$ (We use expansion of $Z$ with $1/(1-x)=1+x+O(x^2)$ .)

For four-point correlator, with some calculation, we can get $\mathbb{E}[z_{\mu_1}z_{\mu_2}z_{\mu_3}z_{\mu_4}]|_\mbox{connected}=-\epsilon \sum_{\rho_1,\cdots,\rho_4}V^{\rho_1\rho_2\rho_3\rho_4}K_{\mu_1\rho_1}K_{\mu_2\rho_2}K_{\mu_3\rho_3}K_{\mu_4\rho_4}+O(\epsilon).$

This type of expansion is known as the $1/n$ expansion or large- $n$ expansion.

Aside: statistical independence and interactions

Two random variables $x$ and $y$ are statistically independent if thier joint distribution factorizes as $p(x,y)=p(x)p(y).$

Interaction is the breakdown of statistical independence.

Nearly-Gaussian actions

General non-Gaussian action is $S(z)=\frac{1}{2}\sum_{\mu,\nu=1}^N K^{\mu\nu} z_\mu z_\nu + \sum_{m=2}^k \frac{1}{(2m)!} \sum_{\mu_1,\cdots,\mu_{2m}=1}^Ns^{\mu_1\cdots \mu_{2m}}z_{\mu_1}\cdots z_{\mu_{2m}}.$

The coefficients $s^{\mu_1\cdots \mu_{2m}}$ are generally known as non-Gaussian couplings, and they control the interactions of the $z_\mu$ .

Search This Blog

BALISADA