[Deep Learning Theory] 3. Effective Theory of Deep Linear Networks at Initialization

  This acticle is one of Textbook Comentary Project.


We introduce and then solve a toy model of deep learning, the deep linear network.


3.1 Deep Linear Networks

A deep linear network iteratively transforms an input $x_{i;\alpha}$ through a sequence of simple linear transformations $$z^{(l+1)}_{i;\alpha}=b^{(l+1)}_i+\sum^{n_l}_{j=1}W_{ij}^{(l+1)}z_{j;\alpha}^{(l)},$$ with $z^{(0)}_{i;\alpha}\equiv x_{i;\alpha}$ and $z^{(l)}_{i;\alpha}\equiv z_i^{(l)}(x_\alpha)$. Since the linear activation function is the identity function, $\sigma(z)=z$, there's no distinction here between preactivations and activations.


In this chapter, we'll simplify matters a bit by turning off all the biases, $b_i^{(l)}=0$, so that the preactivations in layer $l$ are simply given by the repreated matrix multiplication of weight matrices as $$z^{(l)}_{i;\alpha}=\sum_{j_0=1}^{n_0}\sum_{j_1=1}^{n_1}\cdots \sum_{j_{l-1}=1}^{n_l-1} W^{(l)}_{ij_{l-1}}W^{(l-1)}_{j_{l-1}j_{l-2}}\cdots W^{(1)}_{j_1j_0}x_{j_0;\alpha}\equiv \sum_{j=1}^{n_0}\mathcal{W}^{(l)}_{ij}x_{j;\alpha}.$$ Here we have introduced an $n_l$-by-$n_0$ matrix $$\mathcal{W}^{(l)}_{ij}=\sum_{j_1=1}^{n_1}\cdots\sum_{j_{l-1}=1}^{n_{l-1}}W_{ij_{l-1}}^{(l)}W_{j_{l-1}j_{l-2}}^{(l-1)}\cdots W_{j_1j}^{(1)},$$ which highest the fact the preactivation at the $l$-th layer is simply a linear transformation of the input. Additionally, let us set $C^{(l)}_W=\equiv C_W$ so that the order-one part of the weight variance is layer independent. Initialization distribution over the weights is characterized by the following expectations $$\mathbb{E}\left[ W_{ij}^{(l)}\right] =0,\quad \mathbb{E}\left[ W^{(l)}_{i_1j_1}W^{(l)}_{i_2j_2}\right] =\delta_{i_1i_2}\delta_{j_1j_2}\frac{C_W}{n_{l-1}}.$$


Even one layer $W_{ij}^{(l)}$ have Gaussian distribution, total layer $\mathcal{W}^{(l)}_{ij}$ is non-Gaussian.


We are going to compute the nontrivual distribution $$p\left( z^{(l)}|\mathcal{D}\right) \equiv p\left( z^{(l)}(x_1),\cdots, z^{(l)}(x_{N_\mathcal{D}})\right),$$ of the preactivations $z_{i;\alpha}^{(l)}\equiv z_i^{(l)}(x_\alpha)$.


Let's compute mean of the preactivation $z^{(l)}_{i;\alpha}$. Since weight matrices are mutually independent and have zero mean, $$\mathbb{E}\left[ z^{(l)}_{i;\alpha}\right] = \sum_{j_0=1}^{n_0}\sum_{j_1=1}^{n_1}\cdots \sum_{j_{l-1}=1}^{n_{l-1}} \mathbb{E}\left[ W^{(l)}_{ij_{l-1}}W^{(l-1)}_{j_{l-1}j_{l-2}}\cdots W^{(1)}_{j_1j_0}x_{j_0;\alpha}\right] \\ =\sum_{j_0=1}^{n_0}\sum_{j_1=1}^{n_1}\cdots \sum_{j_{l-1}=1}^{n_{l-1}} \mathbb{E}\left[ W^{(l)}_{ij_{l-1}}\right] \mathbb{E}\left[ W^{(l-1)}_{j_{l-1}j_{l-2}}\right] \cdots \mathbb{E} \left[W^{(1)}_{j_1j_0}\right]x_{j_0;\alpha}=0, $$ so we can expect odd-point correlator vanish.



3.2 Criticality

Let's calculate two-point correlator $\mathbb{E}\left[ z^{(l)}_{i_1;\alpha_1}z^{(l)}_{i_2;\alpha_2}\right]$


Math: recursion for the two-point correlator

First-layer preactications in terms of the inputs as $$z^{(1)}_{i;\alpha}=\sum^{n_0}_j W^{(1)}_{ij}x_{j;\alpha},$$ we can express the two-point correlator as $$\mathbb{E}\left[ z^{(1)}_{i_1;\alpha_1}z^{(1)}_{i_2;\alpha_2}\right]=\sum_{j_1,j_2=1}^{n_0}\mathbb{E}\left[ W^{(1)}_{i_1j_1}x_{j_1;\alpha_1}W^{(1)}_{i_2j_2}x_{j_2;\alpha_2}\right]\\ =\sum_{j_1,j_2=1}^{n_0}\mathbb{E}\left[ W^{(1)}_{i_1j_1}W^{(1)}_{i_2j_2}\right] x_{j_1;\alpha_1}x_{j_2;\alpha_2}\\ =\sum_{j_1,j_2=1}^{n_0}\frac{C_W}{n_0}\delta_{i_1i_2}\delta_{j_1j_2}x_{j_1;\alpha_1}x_{j_2;\alpha_2}=\delta_{i_1i_2}C_W\frac{1}{n_0}\sum_{j=1}^{n_0}x_{j;\alpha_1}x_{j;\alpha_2}\\ = \delta_{i_1i_2}C_WG^{(0)}_{\alpha_1\alpha_2},$$ where to go from the second line to the third line we Wick-contracted the two weights and inserted the variance and  the notation $$G_{\alpha_1\alpha_2}^{(0)}\equiv \frac{1}{n_0}\sum^{n_0}_{i=1} x_{i;\alpha_1}x_{i;\alpha_2},$$ for the inner product of two inputs.


Next, expand it to next layer. $$\mathbb{E}\left[ z^{(l+1)}_{i_1;\alpha_1}z^{(l+1)}_{i_2;\alpha_2}\right]=\sum_{j_1,j_2=1}^{n_l}\mathbb{E}\left[ W^{(l+1)}_{i_1j_1}x_{j_1;\alpha_1}W^{(l+1)}_{i_2j_2}z^{(l)}_{j_1;\alpha_1}z^{(l)}_{j_2;\alpha_2}\right]\\ =\sum_{j_1,j_2=1}^{n_l}\mathbb{E}\left[ W^{(l+1)}_{i_1j_1}W^{(l+1)}_{i_2j_2}\right] \mathbb{E}\left[ z^{(l)}_{j_1;\alpha_1}z^{(l)}_{j_2;\alpha_2}\right]\\ =\delta_{i_1i_2}C_W\frac{1}{n_l}\sum_{j=1}^{n_l}\mathbb{E}\left[ z^{(l)}_{j;\alpha_1}z^{(l)}_{j;\alpha_2}\right]\\ \equiv \delta_{i_1i_2}G^{(l)}_{\alpha_1\alpha_2},$$ when $$G^{(l)}_{\alpha_1\alpha_2}=\frac{1}{n_l}\sum_{j=1}^{n_l}\mathbb{E}\left[ z^{(l)}_{j;\alpha_1}z^{(l)}_{j;\alpha_2}\right]$$ 


Finally, we can summarize as $$G^{(l+1)}_{\alpha_1\alpha_2}=C_WG_{\alpha_1\alpha_2}^{(l)},\quad G^{(l)}_{\alpha_1\alpha_2}=(C_W)^lG_{\alpha_1\alpha_2}^{(0)}.$$


Physics: criticality

1. If $C_W>1$, the covariance blows up exponentially, quickly being driven to a fixed point $G^\star_{\alpha_1\alpha_2}=\infty$ for all pairs of inputs and leading to a divergent network output.  (Trivial fixed point)

2. If $C_W<1$, the covariance exponentially decays to a fixed point $G^\star_{\alpha_1\alpha_2}=0$ for all pairs of inputs, quickly curtailing any data dependence in the network output. (Trivial fixed point)

3. If $C_W=1$, then preserving the full covariance of the input data by  $G^{(l)}_{\alpha_1\alpha_2}=G^{(0)}_{\alpha_1\alpha_2}\equiv G^\star_{\alpha_1\alpha_2}$. (Nontrivial fixed point)

When we fine-tune the initialization hyperparameters of a network so that the covariance avoids exponential behavior, we'll call them critical initialization hyperparameters.


Note that the diagonal part of the covariance at the output layer $L$ estimates the typical magnitude of the output for a given input $x_{i;\alpha}$ $$G^{(L)}_{\alpha\alpha} = \mathbb{E}\left[ \frac{1}{n_L}\sum^{n_L}_{j=1}\left( z_{j;\alpha}^{(L)}\right)^2\right].$$ 



3.3 Fluctuations

How Gaussian/Non-Gaussian affect to correlators? Let's look four-point correlator. 


In this section and the next, to simplify the algebra we'll focus on correaltors of preactivations that are evaluated only on a single input $x_\alpha=x$. This is sufficient to qualitatively highlight the importance of the higher-point correlators while letting us avoid the interference of some annoying technical manipulations. Accordingly, in these sections we will drop the sample indices on preactivations and denote the covariance as $$G^{(l)}_2\equiv G^{(l)}_{\alpha\alpha}=G^{(l)}(x,x).$$


Math: recursion for the four-point correlator

The full four-point correlator $$\mathbb{E}\left[ z^{(1)}_{i_1}z^{(1)}_{i_2}z^{(1)}_{i_3}z^{(1)}_{i_4}\right] \\ =\sum_{j_1,j_2,j_3,j_4=1}^{n_0} \mathbb{E}\left[ W^{(1)}_{i_1j_1}W^{(1)}_{i_2j_2}W^{(1)}_{i_3j_3}W^{(1)}_{i_4j_4}\right]x_{j_1}x_{j_2}x_{j_3}x_{j_4}\\ =\frac{C_W^2}{n_0^2}\sum_{j_1,j_2,j_3,j_4=1}^{n_0} (\delta_{i_1i_2}\delta_{j_1j_2}\delta_{i_3i_4}\delta_{j_3j_4}+\delta_{i_1i_3}\delta_{j_1j_3}\delta_{i_2i_4}\delta_{j_2j_4}+\delta_{i_1i_4}\delta_{j_1j_4}\delta_{i_2i_3}\delta_{j_2j_3})x_{j_1}x_{j_2}x_{j_3}x_{j_4}\\ = C^2_W (\delta_{i_1i_2}\delta_{i_3i_4}+\delta_{i_1i_3}\delta_{i_2i_4}+\delta_{i_1i_4}\delta_{i_2i_3})\left( G_2^{(0)}\right)^2,$$ where to go from line two to line three, we made three distinct pairings for the two Wick contraction of the four weights, and then used the weight variance to evaluate each contraction. To get to the final line, we evaluated the sums over the $j$ indices and then substituted using our definition of the inner product, which for a single input simply reads $$G_2^{(0)}=\frac{1}{n_0}\sum_{j=1}^{n_0}x_jx_j.$$ With compare to two-point correlator, deep linear networks appear to be simply Gaussian after a single layer, at least at the four-point correaltor level of analysis.


This Gaussianity does not hold in deeper layers. To see that, let's derive and solve a recursion for the four-point correlator. Beginning with the iteration equation with zero bias, we find $$\mathbb{E}\left[ z^{(l+1)}_{i_1}z^{(l+1)}_{i_2}z^{(l+1)}_{i_3}z^{(l+1)}_{i_4}\right] \\ =\sum_{j_1,j_2,j_3,j_4=1}^{n_l} \mathbb{E}\left[ W^{(l+1)}_{i_1j_1}W^{(l+1)}_{i_2j_2}W^{(l+1)}_{i_3j_3}W^{(l+1)}_{i_4j_4}\right]\mathbb{E}\left[ z^{(l)}_{j_1}z^{(l)}_{j_2}z^{(l)}_{j_3}z^{(l)}_{j_4}\right]\\ =\frac{C_W^2}{n_0^2}\sum_{j_1,j_2,j_3,j_4=1}^{n_0} (\delta_{i_1i_2}\delta_{j_1j_2}\delta_{i_3i_4}\delta_{j_3j_4}+\delta_{i_1i_3}\delta_{j_1j_3}\delta_{i_2i_4}\delta_{j_2j_4}+\delta_{i_1i_4}\delta_{j_1j_4}\delta_{i_2i_3}\delta_{j_2j_3}) \times\mathbb{E}\left[ z^{(l)}_{j_1}z^{(l)}_{j_2}z^{(l)}_{j_3}z^{(l)}_{j_4}\right]\\ = C^2_W (\delta_{i_1i_2}\delta_{i_3i_4}+\delta_{i_1i_3}\delta_{i_2i_4}+\delta_{i_1i_4}\delta_{i_2i_3})\frac{1}{n_l^2}\sum_{j,k=1}^{n_l}\mathbb{E}\left[ z^{(l)}_{j_1}z^{(l)}_{j_2}z^{(l)}_{j_3}z^{(l)}_{j_4}\right],$$ then full four-point correlator is proportional to the factor $(\delta_{i_1i_2}\delta_{j_1j_2}\delta_{i_3i_4}\delta_{j_3j_4}+\delta_{i_1i_3}\delta_{j_1j_3}\delta_{i_2i_4}\delta_{j_2j_4}+\delta_{i_1i_4}\delta_{j_1j_4}\delta_{i_2i_3}\delta_{j_2j_3})$, so $$\mathbb{E}\left[ z^{(l)}_{j_1}z^{(l)}_{j_2}z^{(l)}_{j_3}z^{(l)}_{j_4}\right]\equiv(\delta_{i_1i_2}\delta_{j_1j_2}\delta_{i_3i_4}\delta_{j_3j_4}+\delta_{i_1i_3}\delta_{j_1j_3}\delta_{i_2i_4}\delta_{j_2j_4}+\delta_{i_1i_4}\delta_{j_1j_4}\delta_{i_2i_3}\delta_{j_2j_3})G^{(l)}_4,$$ for first layer becomes $$G_4^{(1)}=C_W^2\left( G^{(0)}_2\right)^2,$$ for general layer, we can use recursion relation, we calculate $$\frac{1}{n_l^2}\sum_{j,k=1}^{n_l}\mathbb{E}\left[ z^{(l)}_{j_1}z^{(l)}_{j_2}z^{(l)}_{j_3}z^{(l)}_{j_4}\right]=\frac{1}{n_l^2}\sum_{j,k=1}^{n_l}(\delta_{jj}\delta_{kk}+\delta_{jk}\delta_{jk}+\delta_{jk}+\delta_{kj})G_4^{(l)}= \left( 1+\frac{2}{n_l}\right)G_4^{(l)}.$$ to make recursion relation $$G_4^{(l+1)}=C_W^2\left( 1+\frac{2}{n_l}\right)G_4^{(l)},\\ G_4^{(l)}=C_W^{2l}\left[ \prod^{l-1}_{l'=1}\left( 1+\frac{2}{n_{l'}}\right)\right]\left(G^{(0)}_2\right)^2=\left[ \prod^{l-1}_{l'=1}\left( 1+\frac{2}{n_{l'}}\right)\right]\left(G^{(l)}_2\right)^2 $$


Physics: large-$n$ expansion, non-Gaussianities, interactions, and fluctuations

When limit $n_l\rightarrow \infty$, $$G_4^{(l)}=\left( G_2^{(l)}\right)^2,$$ and the full four-point correlator becomes $$\mathbb{E}\left[ z^{(l)}_{i_1}z^{(l)}_{i_2}z^{(l)}_{i_3}z^{(l)}_{i_4}\right]\equiv(\delta_{i_1i_2}\delta_{i_3i_4}+\delta_{i_1i_3}\delta_{i_2i_4}+\delta_{i_1i_4}\delta_{i_2i_3})\left(G^{(l)}_2\right)^2.$$ 


To see limit, let $n_1=n_2=\cdots=n_{L-1}\equiv n$. Then we see $$G^{(l)}_4-\left( G^{(l)}_2\right)^2=\left[ \left( 1+\frac{2}{n}\right)^{l-1}-1\right] \left( G^{(l)}_2\right)^2=\frac{2(l-1)}{n}\left( G_2^{(l)}\right)^2+O\left( \frac{1}{n^2}\right).$$ In particular, at criticality where $G^{(l)}_2$ is constant, this leading correction scales inversely proprotionally with the width and proportionally with the depth. This is emergent scale, and there are multiple aspects.

1. The conntected four-point correlator is $$\mathbb{E}\left[ z^{(l)}_{i_1}z^{(l)}_{i_2}z^{(l)}_{i_3}z^{(l)}_{i_4}\right]_\mbox{connected}\equiv(\delta_{i_1i_2}\delta_{i_3i_4}+\delta_{i_1i_3}\delta_{i_2i_4}+\delta_{i_1i_4}\delta_{i_2i_3})\left[G^{(l)}_4-\left(G^{(l)}_2\right)^2\right]^2.$$ Now we can see non-Gaussianity run with $n$ and $l$.

2. Nonzero connected four-point correlator breakdown statistical independence. See connected four-point corelator with $i_1=i_2=j$ and $i_3=i_4=k$ for $j\ne k$. This is $$\mathbb{E}\left[ \left( z^{(l)}_jz^{(l)}_j-G^{(l)}_2\right)\left(z^{(l)}_kz^{(l)}_k-G^{(l)}_2\right)\right] =G^{(l)}_4-\left( G^{(l)}_2\right)^2,\ \mbox{for}\ j\ne k.$$ This shows that deviation of $z^{(l)}_jz^{(l)}_k$ from its mean value $\mathbb{E}\left[ z^{(l)}_kz^{(l)}_j\right]=G^{(l)}_2$ on a particular neuron $j$ is correlated with the same deviation from the mean on a different neuron $k$. We can interpret emergence scale as the strength of the interactions between distince neurons.

3. Some observables that are deterministic in the infinite-width limit start to fluctuate at finite width. For example, $$\mathcal{O}^{(l)}\equiv \mathcal{O}\left( z^{(l)}\right)\equiv \frac{1}{n}\sum^n_{j=1}z^{(l)}_jz^{(l)}_j,\ \mbox{for}\ l<L,$$ which captures the average magnitude of the preactivations over all the different neurons in a hidden layer $l$ for a given instantiation of the network weights. Its means over different realizations of the weights is given by the expectation $$\mathbb{E}\left[ \mathcal{O}^{(l)}\right] = \frac{1}{n}\sum^n_{j=1}\mathbb{E}\left[ z^{(l)}_jz^{(l)}_j\right] =G^{(l)}_2,$$ and the magnitude of this observable's fluctuation from instantiation to instantiation is measured by its variance $$\mathbb{E}\left[ \left( \mathcal{O}^{(l)}-\mathbb{E}\left[ \mathcal{O}^{(l)}\right]\right)^2\right] = \frac{1}{n^2}\sum_{j,k=1}^n \mathbb{E}\left[ z^{(l)}_jz^{(l)}_jz^{(l)}_kz^{(l)}_k\right] - \left( G^{(l)}_2\right)^2=\frac{1}{n^2}\sum_{j,k=1}^n(\delta_{jj}\delta_{kk}+\delta_{jk}\delta_{jk}+\delta_{jk}\delta_{kj})G_4^{(l)}-\left(G_2^{(l)}\right)^2=\frac{2}{n}G_4^{(l)}+\left[ G_4^{(l)}-\left(G_2^{(l)}\right)^2\right]=\frac{2l}{n}\left(G_2^{(l)}\right)^2+O\left( \frac{1}{n^2}\right),$$ so $\mathcal{O}^{(l)}$ is deterministic at infinite width. 


All these finite-width effects are proportional to the depth-to width ratio of the network.



3.4 Chaos

Math: recursions for six-point and higher-point correlators

Starting with the first layer, let's compute a general $2m$-point full correlator. By first-layer preactivations in terms of the input, $$\mathbb{E}\left[ z^{(1)}_{i_1}z^{(1)}_{i_2}\cdots z^{(1)}_{i_{2m-1}}z^{(1)}_{i_{2m}}\right] = \sum_{j_1,\cdots,j_{2m}=1}^{n_0}\mathbb{E}\left[ W^{(1)}_{i_1j_1}W^{(1)}_{i_2j_2}\cdots W^{(1)}_{i_{2m-1}j_{2m-1}}W^{(1)}_{i_{2m}j_{2m}}\right] x_{j_1}x_{j_2}\cdots x_{j_{2m-1}}x_{j_{2m}}=\left( \sum_\mbox{all parings} \delta_{i_{k_1}i_{k_2}}\cdots \delta_{i_{k_{2m-1}}i_{k_{2m}}}\right) C_W^m\left( G_2^{(0)}\right)^m=\left( \sum_\mbox{all parings} \delta_{i_{k_1}i_{k_2}}\cdots \delta_{i_{k_{2m-1}}i_{k_{2m}}}\right) \left( G_2^{()}\right)^m,$$ use Wick's theorem and sum over all possible pairing of the $2m$ auxiliary indices, $k_1,\cdots,k_{2m}$, resulting in $(2m-1)!!$ distince terms. This mean preacticvation distribution for the first layer is completely Gaussian, so the first layer is governed by the quadratic action $$S\left( z^{(1)}\right) =\frac{1}{2G^{(1)}_2}\sum^{n_1}_{i=1} z^{(1)}_iz^{(1)}_i.$$


Before presenting a recursion for general $2m$-point correlators, do it for six-point correlator. $$\mathbb{E}\left[ z^{(l+1)}_{i_1}z^{(l+1)}_{i_2}z^{(l+1)}_{i_3}z^{(l+1)}_{i_4} z^{(l+1)}_{i_5}z^{(l+1)}_{i_6}\right]\\ = \sum_{j_1,j_2,j_3,j_4,j_5,j_6=1}^{n_0}\mathbb{E}\left[ W^{(l)}_{i_1j_1}W^{(l)}_{i_2j_2}W^{(l)}_{i_3j_3}W^{(l)}_{i_4j_4} W^{(l)}_{i_5j_5}W^{(l)}_{i_6j_6}\right] \mathbb{E}\left[ z^{(l)}_{j_1}z^{(l)}_{j_2}z^{(l)}_{j_3}z^{(l)}_{j_4}z^{(l)}_{j_5}z^{(l)}_{j_6}\right] \\ =C^3_W\left( \delta_{i_1i_2}\delta_{i_3i_4}\delta_{i_5i_6}+\delta_{i_1i_3}\delta_{i_2i_4}\delta_{i_5i_6}+\delta_{i_1i_4}\delta_{i_2i_3}\delta_{i_5i_6}\\ +\delta_{i_1i_2}\delta_{i_3i_5}\delta_{i_4i_6}+\delta_{i_1i_3}\delta_{i_2i_5}\delta_{i_4i_6}+\delta_{i_1i_5}\delta_{i_2i_3}\delta_{i_4i_6}\\ +\delta_{i_1i_2}\delta_{i_5i_4}\delta_{i_3i_6}+\delta_{i_1i_5}\delta_{i_2i_4}\delta_{i_3i_6}+\delta_{i_1i_4}\delta_{i_2i_5}\delta_{i_3i_6}\\ +\delta_{i_1i_5}\delta_{i_3i_4}\delta_{i_2i_6}+\delta_{i_1i_3}\delta_{i_5i_4}\delta_{i_2i_6}+\delta_{i_1i_4}\delta_{i_5i_3}\delta_{i_2i_6}\\ +\delta_{i_5i_2}\delta_{i_3i_4}\delta_{i_1i_6}+\delta_{i_5i_3}\delta_{i_2i_4}\delta_{i_1i_6}+\delta_{i_5i_4}\delta_{i_2i_3}\delta_{i_1i_6}\right) \frac{1}{n_l^3}\sum^{n_l}_{i,j,k=1} \mathbb{E}\left[ z^{(l)}_iz^{(l)}_iz^{(l)}_jz^{(l)}_jz^{(l)}_kz^{(l)}_k\right]$$ Then we can suggest a decomposition of the six-point correlator as $$\mathbb{E}\left[ z^{(l)}_{j_1}z^{(l)}_{j_2}z^{(l)}_{j_3}z^{(l)}_{j_4}z^{(l)}_{j_5}z^{(l)}_{j_6}\right]\equiv (\delta_{i_1i_2}\delta_{i_3i_4}\delta_{i_5i_6}+\cdots + \delta_{i_5i_6}\delta_{i_2i_3}\delta_{i_1i_6})G^{(l)}_6,$$ and substitute to find a recursion for $G_6^{(l)}$, we need to perform the sum $$\frac{1}{n_l^3}\sum_{i,j,k=1}^{n_l}\mathbb{E}\left[ z^{(l)}_iz^{(l)}_iz^{(l)}_jz^{(l)}_jz^{(l)}_kz^{(l)}_k\right]$$ There are three types of terms in the sum. There is one term that looks like this $$\frac{1}{n_l^3}\sum_{i,j,k=1}^{n_l} \delta_{ii}\delta_{jj}\delta_{kk}=1,$$ six terms that look like this $$\frac{1}{n_l^3}\sum_{i,j,k=1}^{n_l} \delta_{ij}\delta_{ji}\delta_{kk}=1,$$ and eight terms that look like this $$\frac{1}{n_l^3}\sum_{i,j,k=1}^{n_l} \delta_{ij}\delta_{jk}\delta_{ki}=1.$$ Putting all these terms together, we find a recursion for the layer-dependence of the full six-point correlator $$G^{(l+1)}_6=C_W^3\left( 1+\frac{6}{n_l}+\frac{8}{n_l^2}\right) G_6^{(l)},$$ which has a simple solution $$G^{(l)}_6=C_W^{3l}\left[ \prod_{l'=1}^{l-1}\left( 1+\frac{6}{n_l}+\frac{8}{n_{l'}}^8\right)\right] \left( G_2^{(0)}\right)^3=\left[\prod_{l'=1}^{l-1}\left( 1+\frac{6}{n_{l'}}+\frac{8}{n_{l'}}\right)\right] \left( G_2^{(l)}\right)^3.$$


Similarly, we can decompose an arbitrary $2m$-point correlator as $$\mathbb{E}\left[ z^{(l)}_{i_1}z^{(l)}_{i_2}\cdots z^{(l)}_{i_{2m-1}}z^{(l)}_{i_{2m}}\right] = \left( \sum_\mbox{all parings} \delta_{i_{k_1}i_{k_2}}\cdots \delta_{i_{k_{2m-1}}i_{k_{2m}}}\right) G^{(l)}_{2m},$$ and use a similar set of manipulations to show that the layer dependence $G^{(l)}_{2m}$ obeys a recursion $$G^{(l+1)}_{2m}=c_{2m}(n_l)C^m_WG_{2m}^{(l)},$$ with the combinatorial factor $c_{2m}(n)$ given by $$c_{2m}(n)=\left( 1+\frac{2}{n}\right) \left( 1+\frac{4}{n}\right) \cdots \left( 1+\frac{2m-2}{n}\right) =\frac{(\frac{n}{2}-1+m)!}{(\frac{n}{2}-1)!}\left( \frac{2}{n}\right)^m.$$ Finally, $$G_{2m}^{(l)}=\left[ \prod_{l'=1}^{l-1} c_{2m}(n'_l)\right] \left( G_2^{(l)}\right)^m.$$


Physics: breakdown of perturbation theory and the emergence of chaos

Let's set $n_1=n_2=\cdots=n_{L-1}\equiv n$, and also focus only on output distribution $p\left( z^{(L)}|x\right)$.


Infinite width limit $n\rightarrow \infty$, $$\lim_{n\rightarrow \infty}c_{2m}(n)=1$$ then $$G^{(L)}_{2m}=\left( G^{(L)}_2\right)^m,$$ the output distribution $p\left( z^{(L)}|x\right)$ is Gaussian. 


Infinite depth limit $L\rightarrow \infty$, all $c_{2m}>1$, then higher-point correlators for $2m>2$ will all blow up exponentially as $$G^{(L)}_{2m}=\left[ c_{2m}(n)\right]^{L-1}\left( G^{(L)}_2\right)^m.$$ It's far from Gaussian, but chaotic.


And these limits do not commute, $$\lim_{n\rightarrow \infty} \lim_{L\rightarrow \infty} G^{(L)}_{2m}\ne \lim_{L\rightarrow \infty} \lim_{n\rightarrow \infty} G^{(L)}_{2m}.$$ However, we can construct an interpolating solution by sending both the width and depth to infinity, $n,L\rightarrow \infty$, while keeping their ratio fixed: $$r\equiv \frac{L}{n}.$$ Expand the combinatorial factors as $$c_{2m}(n)=1+\frac{1}{n}\left( \sum_{s=1}^{m-1} 2s\right) +O\left( \frac{1}{n^2}\right) = 1+\frac{m(m-1)}{n}+O\left( \frac{1}{n^2}\right),$$ and then using the well-known formula for the exponential $$\lim_{L\rightarrow \infty} \left[ 1+\frac{a}{L}+O\left( \frac{1}{L^2}\right)\right]^L=e^a,$$ we can construct a limiting value for any correlator at a given value of $m$ and fixed aspect ratio $r$: $$G^{(L)}_{2m}\rightarrow e^{m(m-1)r}\left( G^{(L)}_2\right)^m.$$ This means when $r\rightarrow 0$, it becomes Gaussian, and $r\rightarrow \infty$, it becomes chaotic.


When criticality situation $G^{(L)}_2=G^{(0)}_2$, four-point correlator $$G^{(L)}_4-\left( G_2^{(L)}\right)^2=\left(e^{2r}-1\right) \left( G_2^{(0)}\right)^2=2r\left( G_2^{(0)}\right)^2+O(r^).$$ Also six-point correlator is given by $$G^{(L)}_6-3G^{(L)}_2G^{(L)}_4+2\left( G^{(L)}_2\right)^3=\left( e^{6r}-3e^{2r}+2\right) \left( G^{(0)}_2\right)^3=12r^2\left( G^{(0)}_2\right)^3+O(r^3).$$ So higher order correlator have higher order of $r$ when $r$ is small.