[Deep Learning Theory] 4. RG Flow of Preactivations

  This acticle is one of Textbook Comentary Project.


4.1 First Layer: Good-Old Gaussian

Given a dataset $$\begin{align}\mathcal{D}=\{x_{i;\alpha}\}_{i=1,\cdots,n_0;\alpha=1,\cdots,N_D}\end{align}$$ containing $N_D$ inputs of $n_0$-dimensional vectors, the preactivations in the first layer are given by $$\begin{align}z_{i;\alpha}^{(1)}\equiv z_i^{(1)}(x_\alpha)=b_i^{(1)}+\sum_{j=1}^{n_0}W_{ij}^{(1)}x_{j;\alpha},\quad \mbox{for}\quad i=1,\cdots,n_1.\end{align}$$ At initialization the biases $b^{(1)}$ and weights $W^{(1)}$ are independently distributed according to mean-zero Gaussian distributions with variances $$\begin{align}\mathbb{E}\left[ b^{(1)}_ib^{(1)}_j\right]=\delta_{ij}C_b^{(1)},\\ \mathbb{E}\left[ W^{(1)}_{i_1j_1}W^{(1)}_{i_2j_2}\right] = \delta_{i_1i_2}\delta_{j_1j_2}\frac{C_W^{(1)}}{n_0}.\end{align}$$ The first-layer preactivations $z^{(1)}=z_{i;\alpha}^{(1)}$ form an $(n_1N_\mathcal{D})$-dimensional vector, and we are interested in its distribution at initialization, $$\begin{align}p\left( z^{(1)}|\mathcal{D}\right) =p \left( z^{(1)}(x_1),\cdots ,z^{(1)}(x_{N_\mathcal{D}}) \right) .\end{align}$$ 

Now, let us compute the distribution of the first-layer preactivations at initialization.


Wick this way: combinatorial derivation via correlators

The first derivation involves direct application of Wick contractions to compute correlators of the first-layer distribution (5). Starting with the one-point correlator, simply inserting the definition of the first-layer preactivations (2) gives $$\begin{align}\mathbb{E}\left[ z^{(1)}_{i;\alpha}\right] = \mathbb{E}\left[ b^{(1)}_i + \sum_{j=1}^{n_0} W^{(1)}_{ij} x_{j;\alpha_1}\right] =0,\end{align}$$ since $\mathbb{E}\left[ b_i^{(1)}\right]=\mathbb{E}\left[ W^{(1)}_{ij}\right]=0$. In fact, it's easy to see that all the odd-point correlators of $p\left( z^{(1)}|\mathcal{D}\right)$ vanish because there always is an odd number of either biases $b^{(1)}$ or weights $W^{(1)}$ left unpaired under Wick contarctions. (really??)


Next for the two-point correlator, again inserting the definition (2), we see $$\begin{align}\mathbb{E}\left[ z^{(1)}_{i_1;\alpha_1}z^{(1)}_{i_2;\alpha_2}\right] =\mathbb{E}\left[ \left( b^{(1)}_{i_1}+\sum_{j_1=1}^{n_0}W^{(1)}_{i_1j_1}x_{j_1;\alpha_1}\right)\left( b^{(1)}_{i_2}+\sum_{j_2=1}^{n_0}W^{(1)}_{i_2j_2}x_{j_2;\alpha_2}\right)\right]\nonumber\\ =\delta_{i_1i_2}\left( C^{(1)}_b+C^{(1)}_W\frac{1}{n_0}\sum_{j=1}^{n_0}x_{j;\alpha_1}x_{j;\alpha_2}\right) = \delta_{i_1i_2}G^{(1)}_{\alpha_1\alpha_2},\end{align}$$ where to get to the second line we Wick-contracted the biases and weights using (3) and (4). We also introduced the first-layer metric $$\begin{align}G_{\alpha_1\alpha_2}^{(1)}\equiv C_b^{(1)}+C_W^{(1)}\frac{1}{n_0}\sum_{j=1}^{n_0}x_{j;\alpha_1}x_{j;\alpha_2},\end{align}$$ which is a function of the two samples, $G_{\alpha_1\alpha_2}^{(1)}=G^{(1)}(x_{\alpha_1},x_{\alpha_2})$, and represents the two-point correlation of preactivations in the first layer between different samples.


The higher-point correlators can be obtained similarly. For instance, the full four-point correaltion can be obtained by inserting the definition (2) four times and Wick-contracting the biases and weights, yielding $$\begin{align}\mathbb{E}\left[ z^{(1)}_{i_1;\alpha_1}z^{(1)}_{i_2;\alpha_2}z^{(1)}_{i_3;\alpha_3}z^{(1)}_{i_4;\alpha_4}\right]=\delta_{i_1i_2}\delta_{i_3i_4}G^{(1)}_{\alpha_1\alpha_2}G^{(1)}_{\alpha_3\alpha_4}+\delta_{i_1i_3}\delta_{i_2i_4}G^{(1)}_{\alpha_1\alpha_3}G^{(1)}_{\alpha_2\alpha_4}+\delta_{i_1i_4}\delta_{i_2i_3}G^{(1)}_{\alpha_1\alpha_4}G^{(1)}_{\alpha_2\alpha_3}=\mathbb{E}\left[ z^{(1)}_{i_1;\alpha_1}z^{(1)}_{i_2;\alpha_2}\right]\mathbb{E}\left[ z^{(1)}_{i_3;\alpha_3}z^{(1)}_{i_4;\alpha_4}\right]+\mathbb{E}\left[ z^{(1)}_{i_1;\alpha_1}z^{(1)}_{i_3;\alpha_3}\right]\mathbb{E}\left[ z^{(1)}_{i_2;\alpha_2}z^{(1)}_{i_4;\alpha_4}\right]+\mathbb{E}\left[ z^{(1)}_{i_1;\alpha_1}z^{(1)}_{i_4;\alpha_4}\right]\mathbb{E}\left[ z^{(1)}_{i_2;\alpha_2}z^{(1)}_{i_3;\alpha_3}\right].\end{align}$$ Note that the end result is same as Wick-contracting $z^{(1)}$'s with the variance given by (7). Then connected four-point correlator vanishes, $$\begin{align}\left. \mathbb{E}\left[ z^{(1)}_{i_1;\alpha_1}z^{(1)}_{i_2;\alpha_2}z^{(1)}_{i_3;\alpha_3}z^{(1)}_{i_4;\alpha_4}\right]\right|_\mbox{connected}=0.\end{align}$$ This is similar with connected higher-point correlators vanishing.


Then, first-layer action is inverse of this variance, given by a matrix $\delta_{i_1i_2}G^{\alpha_1\alpha_2}_{(1)}$ that satisfies $$\begin{align}\sum_{j=1}^{n_1}\sum_{\beta\in \mathcal{D}}\left( \delta_{i_1j}G^{\alpha_1\beta}_{(1)}\right)\left( \delta_{ji_2}G^{\beta\alpha_2}_{(1)}\right)=\delta_{i_1i_2}\delta^{\alpha_1}_{\alpha_2},\end{align}$$ with the inverse of the first-layer metric $G^{(1)}_{\alpha_1\alpha_2}$ denoted as $G^{\alpha_1\alpha_2}_{(1)}$ and defined by $$\begin{align}\sum_{\beta\in\mathcal{D}}G^{\alpha_1\beta}_{(1)}G^{(1)}_{\beta\alpha_2}=\delta^{\alpha_1}_{\alpha_2}.\end{align}$$ With this notation, the Gaussian distribution for the first-layer preactivations is expressed as $$\begin{align}p\left( z^{(1)}|\mathcal{D}\right)=\frac{1}{Z}e^{-S\left( z^{(1)}\right)},\end{align}$$ with the quadratic action $$\begin{align}S\left(z^{(1)}\right)=\frac{1}{2}\sum_{i=1}^{n_1}\sum_{\alpha_1,\alpha_2\in \mathcal{D}}G^{\alpha_1\alpha_2}_{(1)}z^{(1)}_{i;\alpha_1}z^{(1)}_{i;\alpha_2},\end{align}$$ and the partition function $$\begin{align}Z=\int \left[ \prod_{i;\alpha} dz^{(1)}_{i;\alpha}\right] e^{-S\left( z^{(1)}\right)}=\left| 2\pi G^{(1)}\right|^{\frac{n_1}{2}},\end{align}$$ where $\left| 2\pi G^{(1)}\right|$ is the determinant of the $N_\mathcal{D}$-by-$N_\mathcal{D}$ matrix $2\pi G^{(1)}_{\alpha_1\alpha_2}$ and, whenever we write out a determinant involving the metric, it will always be that of the metric and not of the inverse metric.


Hubbard-Stratonovich this way: algebraic derivation via action

Let's start with the formal expression for the preactivation distribution (2.33) worked out in the  last chapter $$\begin{align}p\left( z|\mathcal{D}\right) = \int \left[ \prod_i db_i p(b_i) \right] \left[ \prod^{i,j} dW_{ij}p(W_{ij})\right]  \prod_{i,\alpha} \delta \left( z_{i;\alpha}-b_i-\sum_j W_{ij}x_{j;\alpha} \right).\end{align}$$


Then we import a neat trick from theoretical physics called the Hubbard-Stratonovich transformation. Specifically, using the following integral representation of the Dirac delta function (2.32) $$\begin{align}\delta (z-a)=\int \frac{d\Lambda}{2\pi}e^{i\Lambda (z-a)}\end{align}$$ for each constraint and also plugging in explicit expressions for the Gaussian distributions over the parameters, we obtain $$\begin{align}p\left( z|\mathcal{D}\right) = \int \left[ \prod_i \frac{db_i}{\sqrt{2\pi C_b}}\right] \left[ \prod_{i,j} \frac{dW_{ij}}{\sqrt{2\pi C_W /n_0}}\right] \left[ \prod_{i,\alpha}\frac{d\Lambda_i^\alpha}{2\pi}\right] \times \exp \left[ -\sum_i \frac{b_i^2}{2C_b}-n_0\sum_{i,j} \frac{W_{ij}^2}{2C_W}+i\sum_{i,\alpha} \Lambda_i^\alpha \left( z_{i;\alpha} -b_i -\sum_j W_{ij}x_{j;\alpha}\right)\right].\end{align}$$  Computing the square in the exponential , we can see that the action is quadratic in the model parameters $$\begin{align} -\sum_i \frac{b_i^2}{2C_b}-n_0\sum_{i,j} \frac{W_{ij}^2}{2C_W}+i\sum_{i,\alpha} \Lambda_i^\alpha \left( z_{i;\alpha} -b_i -\sum_j W_{ij}x_{j;\alpha}\right)=-\frac{1}{2C_b} \sum_i \left( b_i+iC_b\sum_\alpha \Lambda_i^\alpha\right)^2-\frac{C_b}{2}\sum_i\left(\sum_\alpha \Lambda^\alpha \right)^2-\frac{n_0}{2C_W} \sum_{i,j} \left( W_{i,j}+i\frac{C_W}{n_0} \sum_\alpha \Lambda_i^\alpha x_{j;\alpha}\right)^2 -\frac{C_W}{2n_0}\sum_{i,j}\left( \sum_\alpha \Lambda_i^\alpha x_{j;\alpha}\right)^2 + i\sum_{i,\alpha} \Lambda_i^\alpha z_{i;\alpha}.\end{align}$$ The biases and weights can be integrated out, yielding an integral representation ofr the first-layer distribution $p(z)$ as $$\begin{align} \int\left[ \prod_{i,\alpha} \frac{d\Lambda_i^\alpha}{2\pi} \right] \exp \left[ -\frac{1}{2}\sum_{i,\alpha_1,\alpha_2} \Lambda_i^{\alpha_1}\Lambda_i^{\alpha_2}\left( C_b+C_W \sum_j \frac{x_{j;\alpha_1}x_{j;\alpha_2}}{n_0}\right)+i\sum_{i,\alpha} \Lambda_i^\alpha z_{i;\alpha}\right].\end{align}$$


Note that the inverse variance for the Hubbard-Stratonovich variables $\Lambda_i^\alpha$ is just the first-layer metric (8) we introduced in the Wick-contraction derivation, $$\begin{align}C_b^{(1)}+C_W^{(1)}\sum_j \frac{x_{j;\alpha_1}x_{j;\alpha_2}}{n_0}=G^{(1)}_{\alpha_1\alpha_2},\end{align}$$ exponential become $$\begin{align}-\frac{1}{2}\sum_{i;\alpha_1,\alpha_2}\left[ G^{(1)}_{\alpha_1\alpha_2} \left( \Lambda_i^{\alpha_1}-i\sum_{\beta_1}G^{\alpha_1\beta_1}_{(1)}z^{(1)}_{i;\beta_1}\right) \left( \Lambda_i^{\alpha_2}-i\sum_{\beta_2}G^{\alpha_2\beta_2}_{(1)}z^{(1)}_{i;\beta_2}\right) + G_{(1)}^{\alpha_1\alpha_2}z^{(1)}_{i;\alpha_1}z^{(1)}_{i;\alpha_2}\right],\end{align}$$ which finally lets us integrate out the Hubbard-Stratonovich variables $\Lambda_i^\alpha$ and recover our previous result $$\begin{align} p\left( z^{(1)}|\mathcal{D}\right) =\frac{1}{\left| 2\pi G^{(1)}\right|^\frac{n_1}{2}}\exp \left( -\frac{1}{2}\sum_{i=1}^{n_1} \sum_{\alpha_1,\alpha_2\in \mathcal{D}} G^{\alpha_1\alpha_2}_{(1)}z^{(1)}_{i;\alpha_1}z^{(1)}_{i;\alpha_2}\right).\end{align}$$ 


Gaussian action in action

We start by computing the expectations on the same neuron, $\mathbb{E}\left[ \sigma\left( z^{(1)}_{i_1;\alpha_1}\right) \sigma\left( z^{(1)}_{i_1;\alpha_2}\right)\right]$, and the expectation of four activations, $\mathbb{E}\left[ \sigma\left( z^{(1)}_{i_1;\alpha_1}\right) \sigma\left( z^{(1)}_{i_1;\alpha_2}\right) \sigma\left( z^{(1)}_{i_2;\alpha_3}\right) \sigma\left( z^{(1)}_{i_2;\alpha_4}\right)\right]$, either with all four on the same neuron $i_1=i_2$ or with each pair on two separate neurons $i_1\ne i_2$.


Let's start with the two-point correlator of activations. Using the definition of the expectation and inserting the action representation of the distrivution (23), we get $$\begin{align}\mathbb{E}\left[ \sigma\left( z^{(1)}_{i_1;\alpha_1}\right) \sigma\left( z^{(1)}_{i_1;\alpha_2}\right)\right]=\int \left[ \prod_{i=1}^{n_1} \frac{\prod_{\alpha\in\mathcal{D}}dz_{i;\alpha}}{\sqrt{\left| 2\pi G^{(1)}\right|}}\right] \exp \left( -\frac{1}{2}\sum_{j=1}^{n_1}\sum_{\beta_1,\beta_2\in\mathcal{D}}G^{\beta_1\beta_2}_{(1)}z_{j;\beta_1}z_{j;\beta_2}\right) \sigma(z_{i_1;\alpha_1})\sigma(z_{i_1;\alpha_2})= \left\{ \prod_{i\ne i_1}\int \left[ \frac{\prod_{\alpha\in\mathcal{D}}dz_{i;\alpha}}{\sqrt{\left| 2\pi G^{(1)}\right|}}\right] \exp \left( -\frac{1}{2}\sum_{\beta_1,\beta_2\in\mathcal{D}}G^{\beta_1\beta_2}_{(1)}z_{i;\beta_1}z_{i;\beta_2}\right)\right\} \times \int \left[ \frac{\prod_{\alpha\in\mathcal{D}}dz_{i_1;\alpha}}{\sqrt{\left| 2\pi G^{(1)}\right|}}\right] \exp \left( -\frac{1}{2}\sum_{\beta_1,\beta_2\in\mathcal{D}}G^{\beta_1\beta_2}_{(1)}z_{i_1;\beta_1}z_{i_1;\beta_2}\right) \sigma(z_{i_1;\alpha_1})\sigma(z_{i_1;\alpha_2})=\{1\}\times \int \left[ \frac{\prod_{\alpha\in\mathcal{D}}dz_{i_1;\alpha}}{\sqrt{\left| 2\pi G^{(1)}\right|}}\right] \exp \left( -\frac{1}{2}\sum_{\beta_1,\beta_2\in\mathcal{D}}G^{\beta_1\beta_2}_{(1)}z_{i_1;\beta_1}z_{i_1;\beta_2}\right) \sigma(z_{i_1;\alpha_1})\sigma(z_{i_1;\alpha_2})\equiv \left\langle \sigma(z_{\alpha_1})\sigma(z_{\alpha_2})\right\rangle_{G^{(1)}}.\end{align}$$ Final line we reintroduces the notation $$\begin{align} \langle\langle F(z_{\alpha_1},\cdots,z_{\alpha_m})\rangle\rangle_g  \equiv \int \left[ \frac{\prod_{\alpha\in \mathcal{D}} dz_{\alpha}}{\sqrt{\left| 2\pi g\right| }}\right] \exp \left( -\frac{1}{2} \sum_{\beta_1,\beta_2\in \mathcal{D}} g^{\beta_1\beta_2}z_{\beta_1}z_{\beta_2}\right)F(z_{\alpha_1},\cdots,z_{\alpha_m}),\end{align}$$ to describe a Gaussian expectation with variance $g$ and an arbitrary function $F(z_{\alpha_1},\cdots,z_{\alpha_m})$ over variables with sample indices only. Introducing further the simplifying notation $$\begin{align}\sigma_\alpha\equiv\sigma(z_\alpha),\end{align}$$ the result of the computation above can be succinctly summarized as $$\begin{align}\mathbb{E}\left[ \sigma\left( z^{(1)}_{i_1;\alpha_1}\right) \sigma\left( z^{(1)}_{i_1;\alpha_2}\right)\right] =\langle \sigma_{\alpha_1}\sigma_{\alpha_2}\rangle_{G^{(1)}}.\end{align}$$


For four activations on the same neuron $i_1=i_2$, we have by the exact same manipluations $$\begin{align} \mathbb{E}\left[ \sigma\left( z^{(1)}_{i_1;\alpha_1}\right) \sigma\left( z^{(1)}_{i_1;\alpha_2}\right) \sigma\left( z^{(1)}_{i_1;\alpha_3}\right) \sigma\left( z^{(1)}_{i_1;\alpha_4}\right)\right]=\left\langle \sigma_{\alpha_1}\sigma_{\alpha_2}\sigma_{\alpha_3}\sigma_{\alpha_4}\right\rangle_{G^{(1)}},\end{align}$$ and for each pair on two different neurons $i_1\ne i_2$, we have $$\begin{align}\mathbb{E}\left[ \sigma\left( z^{(1)}_{i_1;\alpha_1}\right) \sigma\left( z^{(1)}_{i_1;\alpha_2}\right) \sigma\left( z^{(1)}_{i_2;\alpha_3}\right) \sigma\left( z^{(1)}_{i_2;\alpha_4}\right)\right]=\left\{ \prod_{i\notin \{i_1,i_2\}}\int \left[ \frac{\prod_{\alpha\in\mathcal{D}}dz_{i;\alpha}}{\sqrt{\left| 2\pi G^{(1)}\right|}}\right] \exp \left( -\frac{1}{2}\sum_{\beta_1,\beta_2\in\mathcal{D}}G^{\beta_1\beta_2}_{(1)}z_{i;\beta_1}z_{i;\beta_2}\right)\right\} \times \int \left[ \frac{\prod_{\alpha\in\mathcal{D}}dz_{i_1;\alpha}}{\sqrt{\left| 2\pi G^{(1)}\right|}}\right] \exp \left( -\frac{1}{2}\sum_{\beta_1,\beta_2\in\mathcal{D}}G^{\beta_1\beta_2}_{(1)}z_{i_1;\beta_1}z_{i_1;\beta_2}\right) \sigma(z_{i_1;\alpha_1})\sigma(z_{i_1;\alpha_2}) \times \int \left[ \frac{\prod_{\alpha\in\mathcal{D}}dz_{i_2;\alpha}}{\sqrt{\left| 2\pi G^{(1)}\right|}}\right] \exp \left( -\frac{1}{2}\sum_{\beta_1,\beta_2\in\mathcal{D}}G^{\beta_1\beta_2}_{(1)}z_{i_2;\beta_1}z_{i_2;\beta_2}\right) \sigma(z_{i_2;\alpha_3})\sigma(z_{i_2;\alpha_4})=\left\langle \sigma_{\alpha_1}\sigma_{\alpha_2}\right\rangle_{G^{(1)}}\left\langle \sigma_{\alpha_3}\sigma_{\alpha_4}\right\rangle_{G^{(1)}},\end{align}$$ This illustrates the fact that neurons are independent, and thus there is no interaction among different neurons in the first layer. In deeper layers, the preactivation distributions are nearly-Gaussian and things will be a bit more complicated.


4.2 Second Layer: Genesis of Non-Gaussianity

Distribution of preactivations in the second layer of an MLP is $$\begin{align} z^{(2)}_{i;\alpha}\equiv z^{(2)}_i(x_\alpha)=b^{(2)}_i+\sum^{n_1}_{j=1}W^{(2)}_{ij}\sigma^{(1)}_{j;\alpha},\quad \mbox{for }i=1,\cdots,n_2,\end{align}$$ wutg the first-layer activations denoted as $$\begin{align}\sigma^{(1)}_{i;\alpha}\equiv\sigma\left( z^{(1)}_{i;\alpha}\right),\end{align}$$ and the biases $b^{(2)}$ and weights $W^{(2)}$ sampled from Gaussian distrivutions.


The joint distrivution of preactivations in the first and second layers can be factorized as $$\begin{align}p\left( z^{(2)},z^{(1)}|\mathcal{D}\right) =p\left( z^{(2)}|z^{(1)}\right) p\left( z^{(1)}|\mathcal{D}\right).\end{align}$$ Here the first-layer marginal distribution $p\left(z^{(1)}|\mathcal{D}\right)$ was evaluated in the last section, 4.1, to be a Gaussian distribution (23) with the variance given in terms of the first-later metric $G^{(1)}_{\alpha_1\alpha_2}$. As for the conditional distribution, we know it can be expressed as $$\begin{align}p\left( z^{(2)}|z^{(1)}\right) = \int \left[ \prod_i db_i^{(2)}p\left( b^{(2)}_i\right) \right] \left[ \prod_{i,j} dW^{(2)}_{ij}p\left( W^{(2)}_{ij}\right)\right] \prod_{i,\alpha} \delta \left( z^{(2)}_{i;\alpha}-b^{(2)}_i-\sum_j W^{(2)}_{ij}\sigma^{(1)}_{j;\alpha}\right),\end{align}$$ from the formal expression (2.34) for the preactivation distribution condition on the activations in the previous layer. The marginal distribution of the second-layer preactivations can then be obtained by marginalizing over or integrating out the first-layer preactivations as $$\begin{align}p\left( z^{(2)}|\mathcal{D}\right) = \int \left[ \prod_{i,\alpha}dz^{(1)}_{i;\alpha}\right] p\left( z^{(2)}|z^{(1)}\right)p\left( z^{(1)}|\mathcal{D}\right).\end{align}$$ To evaluate this expression for the marginal distribution $p\left( z^{(2)}|\mathcal{D}\right)$, first we’ll discuss how to treat the conditional distribution $p\left( z^{(2)}|z^{(1)}\right)$, and then we’ll explain how to integrate over the first-layer preactivations $z^{(1)}$ governed by the Gaussian distribution (23).


Second-layer conditional distribution

Almost same with first layer, just $x_{j;\alpha}\rightarrow \sigma^{(1)}_{j;\alpha}$, then $$\begin{align} p\left( z^{(1)}|z^{(1)}\right) =\frac{1}{\sqrt{\left| 2\pi \hat{G}^{(2)}\right|^{n_2}}}\exp \left( -\frac{1}{2}\sum_{i=1}^{n_2} \sum_{\alpha_1,\alpha_2\in \mathcal{D}} \hat{G}^{\alpha_1\alpha_2}_{(2)}z^{(2)}_{i;\alpha_1}z^{(2)}_{i;\alpha_2}\right),\end{align}$$ where we have defined the stochastic second-layer metric $$\begin{align}\hat{G}^{(2)}_{\alpha_1\alpha_2}\equiv C^{(2)}_b+C^{(2)}_W\frac{1}{n_1}\sum^{n_1}_{j=1}\sigma^{(1)}_{j;\alpha_1}\sigma^{(1)}_{j;\alpha_2},\end{align}$$ with a hat to emphasize that it is a random variable that depends on the stochastic variable $z^{(1)}$ through $\sigma^{(1)}\equiv \sigma\left( z^{(1)}\right)$. The stochastic second-layer metric fluctuates around the mean second-layer metric $$\begin{align}G^{(2)}_{\alpha_1\alpha_2}\equiv \mathbb{E}\left[ \hat{G}^{(2)}_{\alpha_1\alpha_2}\right] =C^{(2)}_b+C^{(2)}_W\frac{1}{n_2}\sum_{j=1}^{n_1}\mathbb{E}\left[ \sigma^{(1)}_{j;\alpha_1}\sigma^{(1)}_{j;\alpha_2}\right]=C^{(2)}_b+C^{(2)}_W\left\langle \sigma_{\alpha_1}\sigma_{\alpha_2}\right\rangle_{G^{(1)}},\end{align}$$ where in the last step we recalled the result (27) for evaluating the two-point correlator of the first-layer activations on the same nueron. 


Around the mean, we defined the fluctuation of the second-layer metric as $$\begin{align}\hat{\Delta G}_{\alpha_1\alpha_2}^{(2)}\equiv \hat{G}^{(2)}_{\alpha_1\alpha_2}-G^{(2)}_{\alpha_1\alpha_2}=C^{(2)}_W\frac{1}{n_1}\sum^{n_1}_{j=1} \left( \sigma^{(1)}_{j;\alpha_1}\sigma^{(1)}_{j;\alpha_2}-\left\langle \sigma_{\alpha_1}\sigma_{\alpha_2} \right\rangle_{G^{(1)}} \right) ,\end{align}$$ which by construction has the mean zero when averaged over the first-layer preactivations, $$\begin{align}\mathbb{E}\left[ \hat{\Delta G}^{(2)}_{\alpha_1\alpha_2}\right]=0.\end{align}$$ The typical size of the fluctuations is given by its two-point correlator. Recall (27), (28), (29), we obtain $$\begin{align}\mathbb{E}\left[ \hat{\Delta G}^{(2)}_{\alpha_1\alpha_2}\hat{\Delta G}^{(2)}_{\alpha_3\alpha_4}\right]\\ =\left( \frac{C^{(2)}_2}{n_1}\right)^2\sum^{n_1}_{j,k=1}\mathbb{E}\left[ \left( \sigma^{(1)}_{j;\alpha_1}\sigma^{(1)}_{j;\alpha_2}-\mathbb{E}\left[ \sigma^{(1)}_{j;\alpha_1}\sigma^{(1)}_{j;\alpha_2} \right]\right) \left( \sigma^{(1)}_{j;\alpha_3}\sigma^{(1)}_{j;\alpha_4}-\mathbb{E}\left[ \sigma^{(1)}_{j;\alpha_3}\sigma^{(1)}_{j;\alpha_4} \right]\right) \right] \nonumber\\ =\left( \frac{C^{(2)}_2}{n_1}\right)^2\sum^{n_1}_{j=1}\left\{ \mathbb{E}\left[  \sigma^{(1)}_{j;\alpha_1}\sigma^{(1)}_{j;\alpha_2} \sigma^{(1)}_{j;\alpha_3}\sigma^{(1)}_{j;\alpha_4} \right]-\mathbb{E}\left[  \sigma^{(1)}_{j;\alpha_1}\sigma^{(1)}_{j;\alpha_2}\right] \mathbb{E}\left[ \sigma^{(1)}_{j;\alpha_3}\sigma^{(1)}_{j;\alpha_4} \right] \right\} \nonumber\\ \frac{1}{n_1}\left( C^{(2)}_W\right)^2 [\langle \sigma_{\alpha_1}\sigma_{\alpha_2}\sigma_{\alpha_3}\sigma_{\alpha_4}\rangle_{G^{(1)}}-\langle \sigma_{\alpha_1}\sigma_{\alpha_2}\rangle_{G^{(1)}} \langle \sigma_{\alpha_3}\sigma_{\alpha_4}\rangle_{G^{(1)}}] \nonumber\\ \equiv \frac{1}{n_1}V^{(2)}_{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}, \nonumber\end{align}$$ where at the end we defineded the second-layer four-point vertex $V^{(2)}_{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}=V(x_{\alpha_1},x_{\alpha_2};x_{\alpha_3},x_{\alpha_4})$, which depends on four input data points and is symmetric under exchanges of sample indicecs $\alpha_1\leftrightarrow \alpha_2$, $\alpha_3 \leftrightarrow \alpha_4$, and $(\alpha_1,\alpha_2)\leftrightarrow (\alpha_3,\alpha_4)$. We can see when $n_1>>1$, it become Gaussian.


Wick Wick Wick: combinatorial derivation

The two-point correlator of the second-layer preactivations is given by $$\begin{align}\mathbb{E}\left[ z^{(2)}_{i_1;\alpha_1}z^{(2)}_{i_2;\alpha_2}\right] =\delta_{i_1i_2}\mathbb{E}\left[ \hat{G}^{(2)}_{\alpha_1\alpha_2}\right] =\delta_{i_1i_2}G^{(2)}_{\alpha_1\alpha_2}=\delta_{i_1i_2}\left( C^{(2)}_b+C^{(2)}_W\langle \sigma_{\alpha_1}\sigma_{\alpha_2}\rangle_{G^{(1)}}\right),\end{align}$$ where we first used (35) to do single Wick contraction and then inserted (37) for the mean of stochastic metric.


Similarly, the full four-point function can be evaluated as $$\begin{align} \mathbb{E}\left[ z^{(2)}_{i_1;\alpha_1}z^{(2)}_{i_2;\alpha_2}z^{(2)}_{i_3;\alpha_3}z^{(2)}_{i_4;\alpha_4}\right] \nonumber\\ = \delta_{i_1i_2}\delta_{i_3i_4}\mathbb{E}\left[ \hat{G}^{(2)}_{\alpha_1\alpha_2}\hat{G}^{(2)}_{\alpha_3\alpha_4}\right] + \delta_{i_1i_3}\delta_{i_2i_4}\mathbb{E}\left[ \hat{G}^{(2)}_{\alpha_1\alpha_3}\hat{G}^{(2)}_{\alpha_2\alpha_4}\right] +\delta_{i_1i_4}\delta_{i_2i_3}\mathbb{E}\left[ \hat{G}^{(2)}_{\alpha_1\alpha_4}\hat{G}^{(2)}_{\alpha_2\alpha_3}\right]\nonumber\\ =\delta_{i_1i_2}\delta_{i_3i_4}G^{(2)}_{\alpha_1\alpha_2}G^{(2)}_{\alpha_3\alpha_4}+\delta_{i_1i_3}\delta_{i_2i_4}G^{(2)}_{\alpha_1\alpha_3}G^{(2)}_{\alpha_2\alpha_4}+\delta_{i_1i_4}\delta_{i_2i_3}G^{(2)}_{\alpha_1\alpha_4}G^{(2)}_{\alpha_2\alpha_3} +\frac{1}{n_1}\left[ \delta_{i_1i_2}\delta_{i_3i_4}V^{(2)}_{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}+\delta_{i_1i_3}\delta_{i_2i_4}V^{(2)}_{(\alpha_1\alpha_3)(\alpha_2\alpha_4)} +\delta_{i_1i_4}\delta_{i_2i_3}V^{(2)}_{(\alpha_1\alpha_4)(\alpha_2\alpha_3)}\right],\end{align}$$ where in the first line we made three Wick contractions of the four second-layer preactivations $z^{(2)}$'s using the Gaussian distribution (35), and then in the second liner we recalled (39) and (40). Noted that $G^{(2)}$ is function of the first-layer preactivation $z^{(1)}$. Then we get connected four-point correaltor $$\begin{align} \mathbb{E}\left.\left[ z^{(2)}_{i_1;\alpha_1}z^{(2)}_{i_2;\alpha_2}z^{(2)}_{i_3;\alpha_3}z^{(2)}_{i_4;\alpha_4}\right]\right|_\mbox{connected} \nonumber\\ =\frac{1}{n_1}\left[ \delta_{i_1i_2}\delta_{i_3i_4}V^{(2)}_{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}+\delta_{i_1i_3}\delta_{i_2i_4}V^{(2)}_{(\alpha_1\alpha_3)(\alpha_2\alpha_4)} +\delta_{i_1i_4}\delta_{i_2i_3}V^{(2)}_{(\alpha_1\alpha_4)(\alpha_2\alpha_3)}\right],\end{align}$$ Here we see the connected second-layer four-point correlator and controls the near-Gaussianity of the second-layer preactivation distribution. We also see second-layer preactivation distribution $p\left( z^{(2)}|\mathcal{D}\right)$ is in general non-Gaussian, but $n_1 \gg 1$ make more Gaussian. 


Now we find an action that generates corrlations (41) and (43). We assume there is quartic term base on quadratic action. Let's start with a quartic action for an $(nN_\mathcal{D})$-dimenaional random variable $z$ $$\begin{align}S[z]=\frac{1}{2}\sum_{\alpha_1,\alpha_2\in \mathcal{D}}g^{\alpha_1\alpha_2}\sum^n_{i=1}z_{i;\alpha_1}z_{i;\alpha_2}-\frac{1}{8}\sum_{\alpha_1,\cdots,\alpha_4\in \mathcal{D}}v^{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}\sum^n_{i_1,i_2=1}z_{i_1;\alpha_1}z_{i_1l;\alpha_2}z_{i_2;\alpha_3}z_{i_2;\alpha_4},\end{align}$$ with undetermined couplings $g$ and $v$. We will treat the quartic coupling $v$ perturbatively, and assumption that we will justify later by relating the quartic coupling $v$ to the $1/n_1$-suppressed connected four-point correlator. Note that by construction the quartic coupling $v^{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}$ has the same symmetric structure as the four-point vertex $V^{(2)}_{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}$ with respect to the sample indices.


Before proceeding further, it is convenient to intoduce some notation. In (25), defined $\langle F(z_{\alpha_1},\cdots,z_{\alpha_m})\rangle_g$ for the average of an arbitrary function $F$ over a Gaussian distribution with variance $g$, where preactivation variables $z_\alpha$ have sample indices only. In addition, we here define $$\begin{align} \langle\langle F(z_{i_1;\alpha_1},\cdots,z_{i_m;\alpha_m})\rangle\rangle_g \nonumber\\ \equiv \int \left[ \prod^n_{i=1}\frac{\prod_{\alpha\in \mathcal{D}} dz_{i;\alpha}}{\sqrt{\left| 2\pi g\right| }}\right] \exp \left( -\frac{1}{2} \sum^n_{j=1}\sum_{\beta_1,\beta_2\in \mathcal{D}} g^{\beta_1\beta_2}z_{j;\beta_1}z_{j;\beta_2}\right)F(z_{i_1;\alpha_1},\cdots,z_{i_m;\alpha_m}),\end{align}$$ which now includes neural indices.


With this notation in hand, the expectation of an arbitrary function $F(z_{i_1;\alpha_1},\cdots,z_{i_m;\alpha_m})$ against a distribution with the quartic action can be rewritten in terms of Gaussian expectations, enabling the perturbative expansion in the coupling $v$ as $$\begin{align}\mathbb{E}[F(z_{i_1;\alpha_1},\cdots,z_{i_m;\alpha_m})]=\frac{\int \left[ \prod_{i,\alpha}dz_{i;\alpha}\right] e^{-S(z)}F(z_{i_1;\alpha_1},\cdots,z_{i_m;\alpha_m})}{\int \left[ \prod_{i,\alpha}dz_{i;\alpha}\right] e^{-S(z)}}\\ =\frac{\langle\langle \exp \left\{ \frac{1}{8}\sum_{\beta_1,\cdots,\beta_4\in\mathcal{D}}v^{(\beta_1\beta_2)(\beta_3)\beta_4)}\sum^n_{j_1,j_2=1}z_{j_1;\beta_1}z_{j_1;\beta_2}z_{j_2;\beta_3}z_{j_2;\beta_4}\right\} F(z_{i_1;\alpha_1},\cdots,z_{i_m;\alpha_m}) \rangle\rangle_g}{\langle\langle \exp \left\{ \frac{1}{8}\sum_{\beta_1,\cdots,\beta_4\in\mathcal{D}}v^{(\beta_1\beta_2)(\beta_3)\beta_4)}\sum^n_{j_1,j_2=1}z_{j_1;\beta_1}z_{j_1;\beta_2}z_{j_2;\beta_3}z_{j_2;\beta_4}\right\} \rangle\rangle_g} \nonumber\\ =\langle\langle F(z_{i_1;\alpha_1},\cdots,z_{i_m;\alpha_m})\rangle\rangle_g + \frac{1}{8}\sum_{\beta_1,\cdots,\beta_4\in\mathcal{D}}v^{(\beta_1\beta_2)(\beta_3\beta_4)}\sum^n_{j_1,j_2=1}\left[ \langle\langle z_{j_1;\beta_1}z_{j_1;\beta_1}z_{j_2;\beta_3}z_{j_2;\beta_4}F(z_{i_1;\alpha_1},\cdots,z_{i_m;\alpha_m})\rangle\rangle_g \nonumber\\ - \langle\langle z_{j_1;\beta_1}z_{j_1;\beta_1}z_{j_2;\beta_3}z_{j_2;\beta_4}\rangle\rangle_g\langle\langle F(z_{i_1;\alpha_1},\cdots,z_{i_m;\alpha_m})\rangle\rangle_g\right] + O\left( v^2\right),\end{align}$$ 


With this in mind, let's consider some particular choices for $F$. Starting with the two-point correlator, we get $$\begin{align} \mathbb{E}[z_{i_1;\alpha_1}z_{i_2;\alpha_2}]=\delta_{i_1i_2}\left[ g_{\alpha_1\alpha_2}+\frac{1}{2}\sum_{\beta_1,\cdots,\beta_4\in \mathcal{D}}v^{(\beta_1\beta_2)(\beta_3\beta_4)}(ng_{\alpha_1\beta_1}g_{\alpha_2\beta_2}g_{\beta_3\beta_4}+2g_{\alpha_1\beta_1}g_{\alpha_2\beta_3}g_{\beta_2\beta_4})\right] +O\left( v^2\right).\end{align}$$ Similarly, we find that the connected four-point correlator evaluates to $$\begin{align}\mathbb{E}\left. [z_{i_1;\alpha_1}z_{i_2;\alpha_2}z_{i_3;\alpha_3}z_{i_4;\alpha_4}]\right|_\mbox{connected}\\ \equiv \mathbb{E} [z_{i_1;\alpha_1}z_{i_2;\alpha_2}z_{i_3;\alpha_3}z_{i_4;\alpha_4}] -\mathbb{E}[z_{i_1;\alpha_1}z_{i_2;\alpha_2}]\mathbb{E}[z_{i_3;\alpha_3}z_{i_4;\alpha_4}]  \nonumber\\ -\mathbb{E}[z_{i_1;\alpha_1}z_{i_3;\alpha_3}]\mathbb{E}[z_{i_2;\alpha_2}z_{i_4;\alpha_4}]  -\mathbb{E}[z_{i_1;\alpha_1}z_{i_4;\alpha_4}]\mathbb{E}[z_{i_2;\alpha_2}z_{i_3;\alpha_3}] \nonumber\\ = \delta_{i_1i_2}\delta_{i_3i_4}\sum_{\beta_1,\cdots,\beta_4\in\mathcal{D}}v^{(\beta_1\beta_2)(\beta_3\beta_4)}g_{\alpha_1\beta_1}g_{\alpha_2\beta_2}g_{\alpha_3\beta_3}g_{\alpha_4\beta_4} \nonumber\\ +\delta_{i_1i_3}\delta_{i_2i_4}\sum_{\beta_1,\cdots,\beta_4\in\mathcal{D}}v^{(\beta_1\beta_3)(\beta_2\beta_4)}g_{\alpha_1\beta_1}g_{\alpha_3\beta_3}g_{\alpha_2\beta_2}g_{\alpha_4\beta_4} \nonumber\\ \delta_{i_1i_4}\delta_{i_2i_3}\sum_{\beta_1,\cdots,\beta_4\in\mathcal{D}}v^{(\beta_1\beta_4)(\beta_2\beta_3)}g_{\alpha_1\beta_1}g_{\alpha_4\beta_4}g_{\alpha_2\beta_2}g_{\alpha_3\beta_3}.\nonumber\end{align}$$ Comparing these expressions, (47) and (48), with correlators in the second layer, (41) and (43), we can find $$\begin{align} g^{\alpha_1\alpha_2}=G^{\alpha_1\alpha_2}_{(2)}+O\left( \frac{1}{n_1}\right),\\ v^{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}=\frac{1}{n_1}V^{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}{(2)}+O\left(\frac{1}{n_1^2}\right),\end{align}$$ reproduces the second-layer preactication correlators to the leading order in $1/n_1$, with the marginal distribution $$\begin{align}p\left( z^{(2)}|\mathcal{D}\right) =\frac{1}{Z}e^{-S\left( z^{(2)}\right)}\end{align}$$ and quartic action (44). Here for convenience we have defined a version of the four-point vertex with indices raised by the inverse of the second-layer mean metric $$\begin{align} V^{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}_{(2)}\equiv \sum_{\beta_1,\cdots,\beta_4}G^{\alpha_1\beta_1}_{(2)}G^{\alpha_2\beta_2}_{(2)}G^{\alpha_3\beta_3}_{(2)}G^{\alpha_4\beta_4}_{(2)}V^{(2)}_{(\beta_1\beta_2)(\beta_3\beta_4)}.\end{align}$$ Note that the quartic coupling $v$ is $O(1/n_1)$, justifying our perturbative treatment of the coupling for wide networks. Note also $V$ and $G$ are input independent, but effective strength of interaction between neurons is input dependent. (??)


Schwinger-Dyson this way: algebraic derivation



Nearly-Gaussian action in action



4.3 Deeper Layers: Accumulation of Non-Gaussianity


Recursion

Similar to second-layer.
Interaction amplified.

Action

$$p\left( z^{(l)}|\mathcal{D}\right) = \frac{e^{-S\left(z^{(l)}\right)}}{Z(l)},$$

$$Z(l)\equiv \int \left[ \prod_{i,\alpha} dz^{(l)}_{i;\alpha}\right] e^{-S\left(z^{(l)}\right)},$$

$$\begin{align}S[z]=\frac{1}{2}\sum_{\alpha_1,\alpha_2\in \mathcal{D}}g^{\alpha_1\alpha_2}\sum^n_{i=1}z_{i;\alpha_1}z_{i;\alpha_2}-\frac{1}{8}\sum_{\alpha_1,\cdots,\alpha_4\in \mathcal{D}}v^{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}\sum^n_{i_1,i_2=1}z_{i_1;\alpha_1}z_{i_1l;\alpha_2}z_{i_2;\alpha_3}z_{i_2;\alpha_4}+\cdots\end{align}$$


Here, the coefficients $g^{\alpha_1\alpha_2}_{(l)}$, $v^{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}_{(l)}$, and the implied additional terms in the expansion are data-dependent couplings that together govern the interactions of the neural preactivations and are simply related to the correlators of preactivations $z^{(l)}$. In particular, in 4.2 we gave two derivations for the relations between quadratic and quartic couplings on the one hand and two-point and four-point correlators on the other hand. The same argument applies for an arbitrary layer $l$, and so we have

$$g^{\alpha_1\alpha_2}_{(l)}=G^{\alpha_1\alpha_2}_{(l)}+O(v,\cdots,),$$

$$v^{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}_{(l)}=\frac{1}{n_{l-1}}V^{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}_{(l)}+O(v^2,\cdots),$$ with

$$V^{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}_{(2)}\equiv \sum_{\beta_1,\cdots,\beta_4}G^{\alpha_1\beta_1}_{(2)}G^{\alpha_2\beta_2}_{(2)}G^{\alpha_3\beta_3}_{(2)}G^{\alpha_4\beta_4}_{(2)}V^{(2)}_{(\beta_1\beta_2)(\beta_3\beta_4)}.$$


Large-width expansion

$$n_1,n_2,\cdots,n_{L-1}\sim n\gg 1.$$


$$G^{(l+1)}_{\alpha_1\alpha_2}=C^{(l+1)}_b+C^{(l+1)}_W\frac{1}{n_l}\sum^{n_l}_{j=1}\mathbb{E}\left[ \sigma^{(l)}_{j;\alpha_1}\sigma^{(l)}_{j;\alpha_2}\right].$$

$$G^{(l+1)}_{\alpha_1\alpha_2}=C^{(l+1)}_b+C^{(l+1)}_W\frac{1}{n_l}\sum^{n_l}_{j=1}\langle \sigma^{(l)}_{j;\alpha_1}\sigma^{(l)}_{j;\alpha_2}\rangle_{G^{(l)}}+O\left(\frac{1}{n}\right),$$


$$\frac{1}{n_l}V^{(l+1)}{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}=\left(\frac{C^{(l+1)}_W}{n_l}\right)^2\sum^{n_l}_{j,k=1}\left\{ \mathbb{E}\left[ \sigma^{(l)}_{j;\alpha_1}\sigma^{(l)}_{j;\alpha_2}\sigma^{(l)}_{j;\alpha_3}\sigma^{(l)}_{j;\alpha_4}\right] - \mathbb{E}\left[ \sigma^{(l)}_{j;\alpha_1}\sigma^{(l)}_{j;\alpha_2}\right] \mathbb{E}\left[\sigma^{(l)}_{j;\alpha_3}\sigma^{(l)}_{j;\alpha_4}\right]\right\}$$


$$\mathbb{E}\left[ \sigma^{(l)}_{j;\alpha_1}\sigma^{(l)}_{j;\alpha_2}\sigma^{(l)}_{j;\alpha_3}\sigma^{(l)}_{j;\alpha_4}\right] - \mathbb{E}\left[ \sigma^{(l)}_{j;\alpha_1}\sigma^{(l)}_{j;\alpha_2}\right] \mathbb{E}\left[\sigma^{(l)}_{j;\alpha_3}\sigma^{(l)}_{j;\alpha_4}\right]\\ =\langle\sigma_{\alpha_1}\sigma_{\alpha_2}\sigma_{\alpha_3}\sigma_{\alpha_4}\rangle_{G^{(l)}}-\langle\sigma_{\alpha_1}\sigma_{\alpha_2}\rangle_{G^{(l)}}\langle\sigma_{\alpha_3}\sigma_{\alpha_4}\rangle_{G^{(l)}}+O\left( \frac{1}{n}\right),$$


$$\mathbb{E}\left[ \sigma^{(l)}_{j;\alpha_1}\sigma^{(l)}_{j;\alpha_2}\sigma^{(l)}_{j;\alpha_3}\sigma^{(l)}_{j;\alpha_4}\right] - \mathbb{E}\left[ \sigma^{(l)}_{j;\alpha_1}\sigma^{(l)}_{j;\alpha_2}\right] \mathbb{E}\left[\sigma^{(l)}_{j;\alpha_3}\sigma^{(l)}_{j;\alpha_4}\right]\\ =\frac{1}{4n_{l-1}}\sum_{\beta_1,\cdots,\beta_4\in \mathcal{D}}V^{(\beta_1\beta_2)(\beta_3\beta_4)}{(l)}\langle\sigma_{\alpha_1}\sigma_{\alpha_2}(z_{\beta_1}z_{\beta_2}-g_{\beta_1\beta_2})\rangle_{G^{(l)}}\langle\sigma_{\alpha_3}\sigma_{\alpha_4}(z_{\beta_3}z_{\beta_4}-g_{\beta_3\beta_4})\rangle_{G^{(l)}} \\ +O\left( \frac{1}{n^2}\right),$$ 


$$\frac{1}{n_l}V^{(l+1)}_{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}=\frac{1}{n_l}\left( C^{(l+1)}_W\right)^2[\langle\sigma_{\alpha_1}\sigma_{\alpha_2}\sigma_{\alpha_3}\sigma_{\alpha_4}\rangle_{G^{(l)}}-\langle\sigma_{\alpha_1}\sigma_{\alpha_2}\rangle_{G^{(l)}}\langle\sigma_{\alpha_3}\sigma_{\alpha_4}\rangle_{G^{(l)}}]+\frac{1}{n_{l-1}}\frac{\left(C^{(l+1)}_W\right)^2}{4}\sum_{\beta_1,\cdots,\beta_4\in \mathcal{D}}V^{(\beta_1\beta_2)(\beta_3\beta_4)}{(l)}\langle\sigma_{\alpha_1}\sigma_{\alpha_2}(z_{\beta_1}z_{\beta_2}-g_{\beta_1\beta_2})\rangle_{G^{(l)}}\langle\sigma_{\alpha_3}\sigma_{\alpha_4}(z_{\beta_3}z_{\beta_4}-g_{\beta_3\beta_4})\rangle_{G^{(l)}} \\ +O\left( \frac{1}{n^2}\right)$$


$$\frac{1}{n_l}V^{(l+1)}=O\left(\frac{1}{n}\right),$$


4.4 Marginalization Rules

In the past sections, at each step in the recursions we marginalized over all the preactivations in a given layer. This section collects two remarks on other sorts of partial marginalizations we can perform, rather than integrating out an entire layer. In particular, we’ll discuss marginalization over a subset of the $N_\mathcal{D}$ samples in the dataset $\mathcal{D}$ and marginalization over a subset of neurons in a layer. 


Lossely speaking, these marginalizations let us focus on specific input data and neurons of interset. Tightly speaking, let's consider evaluating the expectation of a function $F(z_{I;\mathcal{A}})=F\left( \{z_{i;\alpha}\}_{i\in I;\alpha\in \mathcal{A}}\right)$ that depends on a subsample $\mathcal{A}\subset\mathcal{D}$ and a subset of neurons $I\subset \{1,\cdots,n_l\}\equiv \mathcal{N}$ in a layer $l$, where a slight abuse of notation we put the set dependence into subscripts. We then have $$\begin{align}\mathbb{E}[F(z_{I;\mathcal{A}})]=\int \left[ \prod_{i\in\mathcal{N}}\prod_{\alpha\in\mathcal{D}}dz_{i;\alpha}\right] F(z_{I;\mathcal{A}}) p\left( z_{\mathcal{N};\mathcal{D}}|\mathcal{D}\right) =\int \left[ \prod_{i\in I}\prod_{\alpha\in\mathcal{A}}dz_{i;\alpha}\right] F(z_{I;\mathcal{A}})\left\{ \int \left[ \prod_{(j;\beta)\in [\mathcal{N}\times \mathcal{D}-I\times \mathcal{A}]}dz_{j;\beta}\right] p\left( z_{\mathcal{N};\mathcal{D}} |\mathcal{D}\right) \right\}= \int \left[ \prod_{i\in I}\prod_{\alpha\in\mathcal{A}}dz_{i;\alpha}\right] F(z_{I;\mathcal{A}}) p\left( z_{I;\mathcal{A}}|\mathcal{A}\right)\end{align}$$ where the last equality is just the marginalization over the spectator variables that do not enter into the observable of interest and, in a sense, defines the subsampled and subneuroned distribution as $$\begin{align}p\left( z_{I;\mathcal{A}}|\mathcal{A}\right) \equiv \int \left[ \prod_{(j;\beta)\in [\mathcal{N}\times \mathcal{D}-I\times \mathcal{A}]}dz_{j;\beta}\right] p\left( z_{\mathcal{N};\mathcal{D}} |\mathcal{D}\right) \end{align}$$ In words, in evaluating the expectation of the function $F(z_{I;\mathcal{A}})$, the full distribution $p\left( z_{\mathcal{N};\mathcal{D}} |\mathcal{D}\right)$ can simply be restricted to that of the subsample $\mathcal{A}$ and subneurons $I$, i.e., $p\left( z_{\mathcal{N};\mathcal{A}} |\mathcal{A}\right)$. We call this property a marginalization rule


Marginalization over samples


Marginalization over neurons

The second corollary involves integrating out a subset of neurons in a layer. Prudent readers might have worried that the quartic term in the `-th-layer action, $$\begin{align}-\frac{1}{8}\sum^{n_l}_{i_1,i_2=1}\sum_{\alpha_1,\cdots,\alpha_4\in\mathcal{D}}v^{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}_{(l)}z^{(l)}_{i_1;\alpha_1}z^{(l)}_{i_1;\alpha_2}z^{(l)}_{i_2;\alpha_3}z^{(l)}_{i_2;\alpha_4},\end{align}$$ seems to naively scale like $\sim n^2_l/n_{l-1}=O(n)$, since there are two sums over $n_l$,  and we know from (82) that the coupling $v_{(l)}$ scales liker $\sim 1/n_{l-1}$.  Similarly, the quadratic term, $$\begin{align} \frac{1}{2}\sum^{n_l}_{i=1}\sum_{\alpha_1,\alpha_2\in\mathcal{D}}g^{\alpha_1\alpha_2}_{(l)}z^{(l)}_{i;\alpha_1}z^{(l)}_{i;\alpha_2},\end{align}$$ has a single over $n_l$ and so seems naively $O(n)$ as well. This would imply that the quartic term isn’t perturbatively suppressed in comparison to the quadratic term, naively calling our perturbative approach into question.



Running couplings with partial marginalizations

In focusing our attention on only a subset of samples or neurons, the data-dependent couplings of the action need to be adjusted. Since this running of the couplings is instructive and will be necessary for later computations, let us illustrate here how the quadratic coupling $g^{\alpha_1\alpha_2}_{(l),m_l}$ depends on the number of neurons $m_l$ in the action.


For simplicity in our illustration, let us specialize to a single input $x$ and drop all the sample indices. Then, denote the distribution over $m_l$ neurons as $$\begin{align}p\left( z^{(l)}_1,\cdots,z^{(l)}_{m_l}\right) \propto e^{-S\left( z^{(l)}_1,\cdots,z^{(l)}_{m_l}\right)}=\exp \left[ -\frac{g_{(l),m_l}}{2}\sum^{m_l}_{j=1}z^{(l)}_jz^{(l)}_j+\frac{v_{(l)}}{8}\sum^{m_l}_{j_1,j_2=1}z^{(l)}_{j_1}z^{(l)}_{j_1}z^{(l)}_{j_2}z^{(l)}_{j_2}\right],\end{align}$$ which is expressed by the same action we've already been using (80), though now the dependence of the quadratic coupling on $m_l$ is made explicit. We'll now see in two ways how the quadratic coupling $g_{(l),m_l}$ runs with $m_l$.


The first way is to debin with the action for $n_l$ neurons and formally integrate out $(n_l-m_l)$ neurons. Without loss of generality, let's integrate out the last $(n_l-m_l)$ neurons, leaving the first $m_l$ neurons labeled as $1,\cdots,m_l$. Using the marginalization rule (93), we see that $$\begin{align}e^{-S\left( z^{(l)}_1,\cdots,z^{(l)}_{m_l}\right)} \propto p\left( z^{(l)}_1,\cdots,z^{(l)}_{m_l}\right) =\int dz^{(l)}_{m_l+1}\cdots dz^{(l)}_{n_l}p\left( z^{(l)}_1,\cdots,z^{(l)}_{n_l}\right) \propto \int dz^{(l)}_{m_l+1}\cdots dz^{(l)}_{n_l} \exp \left[ -\frac{g_{(l),n_l}}{2}\sum^{n_l}_{i=1}z^{(l)}_iz^{(l)}_i+\frac{v_{(l)}}{8}\sum^{n_l}_{i_1,i_2=1}z^{(l)}_{i_1}z^{(l)}_{i_1}z^{(l)}_{i_2}z^{(l)}_{i_2}\right],\end{align}$$ throughout which we neglected normalization factors that are irrelevant if we’re just interested in the running of the coupling. Next, we can separate out the dependence on the $m_l$ neurons, perturbatively expand the integrand in quartic coupling, and finally integrate out the last $(n_l-m_l)$ neurons by computing a few simple Gaussian integrals: $$p\left( z^{(l)}_1,\cdots,z^{(l)}_{m_l}\right)\propto \exp \left[ -\frac{g_{(l),n_l}}{2}\sum^{m_l}_{j=1}z^{(l)}_jz^{(l)}_j+\frac{v_{(l)}}{8}\sum^{m_l}_{j_1,j_2=1}z^{(l)}_{j_1}z^{(l)}_{j_1}z^{(l)}_{j_2}z^{(l)}_{j_2}\right] \times \int dz^{(l)}_{m_l+1}\cdots dz^{(l)}_{n_l} \exp \left[ -\frac{g_{(l),n_l}}{2}\sum^{n_l}_{k=m_l+1}z^{(l)}_kz^{(l)}_k\right]\times \left[ 1+\frac{2v_{(l)}}{8}\sum^{m_l}_{j=1}\sum^{n_l}_{k=m_l+1}z^{(l)}_{j}z^{(l)}_{j}z^{(l)}_{k}z^{(l)}_{k}+\frac{v_{(l)}}{8}\sum^{n_l}_{k_1,k_2=m_l+1}z^{(l)}_{k_1}z^{(l)}_{k_1}z^{(l)}_{k_2}z^{(l)}_{k_2}+O\left( v^2\right)\right] $$ $$= \exp\left[  -\frac{g_{(l),n_l}}{2}\sum^{n_l}_{k=m_l+1}z^{(l)}_kz^{(l)}_k+\frac{v_{(l)}}{8}\sum^{m_l}_{j_1,j_2=1}z^{(l)}_{j_1}z^{(l)}_{k_1}z^{(l)}_{j_2}z^{(l)}_{j_2}\right]\times \left\{ 1+ \frac{(n_l-m_l)}{4}\frac{v_{(l)}}{g_{(l),n_l}}\left( \sum^{m_l}_{i=1}z^{(l)}_iz^{(l)}_i\right)+\frac{v_{(l)}}{8g^2_{(l),n_l}}\left[ (n_l-m_l)^2+2(n_l-m_l)\right] +O\left( v^2\right)\right\}.$$ Finally, resumming the correction arising from the quartic coupling proportional to $\sum^{m_l}_{i=1}z^{(l)}_iz^{(l)}_i$ back into the exponential, ignoring the proportionality factor, and comparing with the action for $m_l$ neurons (97), we find $$\begin{align}g_{(l),m_l}=g_{(l),n_l}-\frac{(n_l-m_l)}{2}\frac{v_{(l)}}{g_{(l),n_l}}+O\left( v^2\right)\end{align}$$ as the running quation for the quadratic coupling.


The second way to see the coupling run - and find a solution to the running equation (100) - is to compute the single-input metric $G^{(l)}\equiv \mathbb{E}\left[ z^{(l)}_iz^{(l)}_i\right]$ and compute it directly using the $m_l$-neuron action (97). We've already computed this in (47) using the quartic action for multiple inputs. Specializing to a single input, considering an action of $m_l$ neurons, and being explicit about the dependence of the quadratic coupling on the number of neurons, we get $$\begin{align}G^{(l)}=\left[ \frac{1}{g_{(l),m_l}}+\frac{(m_l+2)}{2}\frac{v^{(l)}}{g^3_{(l),m_l}}\right] +O\left( v^2\right).\end{align}$$ Solving this equation for $g_{(l),m_l}$ by perturbatively expanding in $v^{(l)}$, we find $$\begin{align} \frac{1}{g_{(l),m_l}}=G^{(l)}-\frac{(m_l+2)}{2}\frac{V^{(l)}}{n_{l-1}G^{(l)}}+O\left( \frac{1}{n^2}\right),\end{align}$$ where we have also plugged in $$\begin{align}v_{(l)}=\frac{V^{(l)}}{n_{l-1}(G^{(l)})^4}+O\left( \frac{1}{n^2}\right),\end{align}$$ using (82) and (83) to relate the quartic coupling to the four-point vertex and again specializing to a single input. Now, it's easy to check that this expression (102) solbes the running equation (100).


The key step in this alternative derivation is realizing that observables without any neural indices such as $G^{(l)}$ should not depend on which version of the m` action we use in computing them. Interpreted another way, what this running of the coupling means is that for different numbers of neurons in a layer $l$ – e.g. $m_l$ and $n_l$ – we need different quadratic couplings – in this case $g_{(l),m_l}$ and $g_{(l),n_l}$ – in order to give the correct value for an $l$-th-layer observable such as $G^{(l)}$ . If you’re ever in doubt, it’s always safest to express an observable of interest in terms of the metric $G^{(l)}$ and the four-point vertex $V^{(l)}$ rather than the couplings.



4.5 Subleading Corrections



4.6 RG Flow and RG Flow

The goal of this chapter was to find the marginal distribution of preactivations $p\left( z^{(l)}|\mathcal{D}\right)$ in a given layer $l$ in terms of an effecticve action with data-dependent couplings. These couplings change(run) from layer to layer, and the running is determined via recursions, which in turn determine how the distrivution of preactivations changes with depth. Equivalently, these recursions tell us how correaltors of preactivations evolve with layer. 

Expressing the two-point correlator of preactivations in terms of the kernel $K^{(l)}_{\alpha_1\alpha_2}$ as $$\mathbb{E}\left[ z^{(l)}_{i_1;\alpha_1}z^{(l)}_{i_2;\alpha_2}\right]=\delta_{i_1i_2}G^{(l)}_{\alpha_1\alpha_2}=\delta_{i_1i_2}\left[ K^{(l)}_{\alpha_1\alpha_2}+O\left( \frac{1}{n}\right)\right],$$ and expressing the four-point connected correlator in terms of the four-point vertex $V^{(l)}_{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}$ as $$ \mathbb{E}\left.\left[ z^{(l)}_{i_1;\alpha_1}z^{(l)}_{i_2;\alpha_2}z^{(l)}_{i_3;\alpha_3}z^{(l)}_{i_4;\alpha_4}\right]\right|_\mbox{connected} \nonumber\\ =\frac{1}{n_1}\left[ \delta_{i_1i_2}\delta_{i_3i_4}V^{(l)}_{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}+\delta_{i_1i_3}\delta_{i_2i_4}V^{(l)}_{(\alpha_1\alpha_3)(\alpha_2\alpha_4)} +\delta_{i_1i_4}\delta_{i_2i_3}V^{(l)}_{(\alpha_1\alpha_4)(\alpha_2\alpha_3)}\right],$$ the running of these correlators is given by the recursions $$K^{(l+1)}_{\alpha_1\alpha_2}=C^{(l+1)}_b+C^{(l+1)}_W\left\langle \sigma_{\alpha_1}\sigma_{\alpha_2}\right\rangle_{K^{(l)}},$$

$$\frac{1}{n_l}V^{(l+1)}_{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}=\left( C^{(l+1)}_W\right)^2[\langle\sigma_{\alpha_1}\sigma_{\alpha_2}\sigma_{\alpha_3}\sigma_{\alpha_4}\rangle_{K^{(l)}}-\langle\sigma_{\alpha_1}\sigma_{\alpha_2}\rangle_{K^{(l)}}\langle\sigma_{\alpha_3}\sigma_{\alpha_4}\rangle_{K^{(l)}}]+\frac{1}{4}\frac{n_l}{n_{l-1}}\left(C^{(l+1)}_W\right)^2\sum_{\beta_1,\cdots,\beta_4\in \mathcal{D}}V^{(\beta_1\beta_2)(\beta_3\beta_4)}_{(l)}\langle\sigma_{\alpha_1}\sigma_{\alpha_2}(z_{\beta_1}z_{\beta_2}-K^{(l)}_{\beta_1\beta_2})\rangle_{K^{(l)}}\langle\sigma_{\alpha_3}\sigma_{\alpha_4}(z_{\beta_3}z_{\beta_4}-K^{(l)}_{\beta_3\beta_4})\rangle_{K^{(l)}} \\ +O\left( \frac{1}{n^2}\right)$$

$$V^{(\alpha_1\alpha_2)(\alpha_3\alpha_4)}_{(l)}\equiv \sum_{\beta_1,\cdots,\beta_4}G^{\alpha_1\beta_1}_{(l)}G^{\alpha_2\beta_2}_{(l)}G^{\alpha_3\beta_3}_{(l)}G^{\alpha_4\beta_4}_{(l)}V^{(l)}_{(\beta_1\beta_2)(\beta_3\beta_4)}=\sum_{\beta_1,\cdots,\beta_4}K^{\alpha_1\beta_1}_{(l)}K^{\alpha_2\beta_2}_{(l)}K^{\alpha_3\beta_3}_{(l)}K^{\alpha_4\beta_4}_{(l)}V^{(l)}_{(\beta_1\beta_2)(\beta_3\beta_4)}+O\left( \frac{1}{n}\right).$$ These recursions dictate how the statistics of preactivations flow with depth.


[In physics] The degree of freedom are represented by a field $\phi(x)$ that may take different balues as a function of spacetime coordinate $x$. First, one divides $\phi(x)$ into fine-graed variables $\phi^+$ consisting of high-frequency modes and coarse-grained variables $\phi^-$ consisting of low-frequancy modes, such that the field decomposes as $\phi(x)=\phi^+(x)+\phi^-(x)$. The full distribution is goverend by the full action $$\begin{align} S_{full}(\phi)=S(\phi^+)+S(\phi^-)+S_I(\phi^+,\phi^-),\end{align}$$ where in particular the last term describes the interactions between these two sets of modes. In order to obtain an effective description in terms of only coarse-grained variable $\phi^-$, we can integrate out (i.e., marginalizes over) the fine-grained variables $\phi^+$ as $$\begin{align}e^{-S_{eff}(\phi^-)}=\int d\phi^+ e^{-S_{full}(\phi)},\end{align}$$ and obtain an effective action $S_{eff}(\phi^-)$,providing an effective theory for the observables of experimental interset. In practice, this marginalization is carries out scale by scale, dividing up the field as $\phi=\phi^{(1)}+\cdots+\phi^{(L)}$ from microscopic modes $\phi^{(1)}$ all the way to macroscopic modes $\phi^{(L)}=\phi^-$, and then integrating out the variables $\phi^{(1)}$,\cdots,\phi^{(L-1)}$ in sequence. Tracking the flow of couplins in the effective action through this marginalization results in the aforementioned beta functions, and in solbing these differential equations up to the scale of interest, we get an effective description of observables at that scale.


Thie is precisely what we have been doing in this chpater for neural networks. The full field $\phi$ os analogous to a collection of all the preactivation $\left\{ z^{(1)},\cdots,z^{(L)}\right\}$. Their distribution is governed by the full joint distribution of preactivations $$\begin{align}p\left( z^{(1)},\cdots,z^{(L)}|\mathcal{D}\right)=p\left( z^{(L)}|z^{(L-1)}\right)\cdots p\left( z^{(2)}|z^{(1)}\right) p\left( z^{(1)}|\mathcal{D}\right),\end{align}$$ with the full action $$\begin{align}S_{full}\left( z^{(1)},\cdots,z^{(L)}\right)\equiv \sum^L_{l=1}S_M\left( z^{(l)}\right) +\sum^{L-1}_{l=1} S_I\left( z^{(l+1)}|z^{(l)}\right).\end{align}$$ Here, the full action is decomposed into the mean quadratic action for variables $z^{(l)}$ $$\begin{align}S_M\left( z^{(l)}\right)=\frac{1}{2}\sum^{n_l}_{i=1} \sum_{\alpha_1,\alpha_2\in \mathcal{D}} G^{\alpha_1\alpha_2}{(l)}z^{(l)}_{i;\alpha_1}z^{(l)}_{i;\alpha_2}\end{align}$$ in terms of the mean metric $G^{(l)}$, (72), and the interaction between neighboring layers $$\begin{align} S_I\left( z^{(l+1)}|z^{(l)}\right) =\frac{1}{2}\sum^{n_{l+1}}_{i=1}\sum_{\alpha_1,\alpha_2\in\mathcal{D}}\left[ \hat{G}^{\alpha_1\alpha_2}_{(l+1)}\left( z^{(l)}\right) -G^{\alpha_1\alpha_2}_{(l+1)}\right] z^{(l+1)}_{i;\alpha_1}z^{(l+1)}_{i;\alpha_2}.\end{align}$$ Here we emphasized that the stochastic metric $\hat{G}^{(l+1)}_{\alpha_1\alpha_2}$ is a function of $z^{(l)}$, and the induced coupling of $z^{(l)}$ with $z^{(l+1)}$ is what leads to the interlayer interactions.


Now, if all we care about are observables that depend only on the outputs of the network - which includes a very important observable... the output! then this full description is too cumbersome. In order to obtain an effective description of the distribution of outputs $z^{(L)}$, we can marginalizes over all the features $\left\{ z^{(1)},\cdots,z^{(L-1)}\right\}$ as $$\begin{align}e^{-S_{eff}\left( z^{(L)}\right)}=\int \left[ \prod^{L-1}_{l=1}dz^{(l)}\right] e^{-S_{full}\left( z^{(1)},\cdots,z^{(L)}\right)},\end{align}$$ just as we integrated out the fine-grained modes $\phi^+$ in (122) to get the effective description in terms of coarse-grained modes $\phi^-$. And, just like in the field theory example, rather than carrying out this marginalization all at once, we proceeded sequentially, integrating out the preactivations layer by layer. This resulted in the recursion relation (118) and (119), and in solving these recursion relations up to the depth of interest, we get an effective description of neural network output at that depth.


In language of RG flow, couplings that grow with the flow are called relevant and those that shrink are called irrelevat. We can determine four-point vertex $V^{(l)}$ is relavant or irrelevant.