[Deep Learning Theory] 2. Neural Networks
This acticle is one of Textbook Comentary Project.
2.1 Function Approximation
The preactivation $z_i$ of a neuron is a linear aggregation of incoming signals $s_j$ where each signal is weighted by $W_{ij}$ and biased by $b_i$ $$z_i(s)=b_i+\sum^{n_{in}}_{j=1} W_{ij} s_j \quad\mbox{for}\quad i=1,\cdots,n_{out}.$$
Each neuron then fires or not according to the weighted and biased evidence, i.e. according to the value of the preactivation $z_i$, and produces and activation $$\sigma_i\equiv \sigma(z_i).$$ The scalar-valued function $\sigma(z)$ is called the activation function and acts independently on each component of the preactivation vector.
Taken toghther, these $n_{out}$ form a layer, which takes in the $n_{in}$-dimensional vector of signals $s_j$ and outputs the $n_{out}$-dimensional vector of activations $\sigma_i$. With this collective perspective, a layer is parametrized by a vector of biases $b_i$ and a matrix of weights $W_{ij}$, where $i=1,\cdots,n_{out}$ and $j=1,\cdots,n_{in}$, together with a fixed activation function $\sigma(z)$.
The organization of the neurons and their pattern of connections is neural network architecture. Stacking layers of many neurons is called the multilayer perceptorn (MLP).
The MLP is recursibely defined through the following iteration equations $$z_i^{(1)}(x_\alpha)\equiv b_i^{(1)} + \sum^{n_0}_{j=1}W^{(1)}_{ij}x_{j;\alpha},\quad \mbox{for}\quad i=1,\cdots,n_1,\\ z_i^{(l+1)}(x_\alpha)\equiv b_i^{(l+1)} + \sum^{n_l}_{j=1}W^{(l+1)}_{ij}\sigma\left( z^{(l)}_j (x_\alpha)\right),\quad \mbox{for}\quad i=1,\cdots,n_{l+1};\ l=1,\cdots, L-1,$$ which describes a network with $L$ layers (depth) of neurons, with each layer $l$ composed of $n_l$ (widths)neurons. Tje depth and hidden-layer widths are variable architecture hyperparameters that define the shape of the network. Final-layer preactivations computed by the network $$f(x;\theta)=z^{(L)}(x),$$ serves as the function approximator, with the model paramters $\theta_\mu$ being the union of the biases and weights from all the layers.
2.2 Activation Functions
Perceptron
$$\sigma(z)=\begin{cases}1,\quad z\ge 0,\\ 0,\quad z<0\end{cases}$$
The perceptron has historical significance, but is never used in deep neural networks.
Sigmoid
$$\sigma(z)=\frac{1}{1+e^{-z}}=\frac{1}{2}+\frac{1}{2}\tanh \left( \frac{z}{2}\right)$$
Tanh
$$\sigma(z)=\tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}=\frac{e^{2z}-1}{e^{2z}+1}$$
Sin
$$\sigma(z)=\sin (z)$$
Scale-invariant: linear, ReLU, and leaky ReLU
A scale-invariant activation function is any activation function that satisfies $$\sigma(\lambda z)=\lambda \sigma (z),$$ for any positive rescaling $\lambda$. This condition is met by activation function of the form $$\sigma(z)=\begin{cases} a_+ z,\quad z\ge 0,\\ a_-z,\quad z<0.\end{cases}$$
The class of scale-invariant activation functions includes linear ($a_+=a_-=a$), ReLU(Rectified Linear Unit) ($a_+=1,\ a_-=0$) and leaky ReLU ($a_+=1,\ a_-=a$) activation functions.
ReLU-like: softplus, SWISH, and GELU
The softplus activation function $$\sigma(z)=\log (1+e^z),$$ behaves linearly $\sigma(z)\approx z$ for a large argument $z \gg 1$ anad vanishes exponentially for a negative argument, $\sigma(z)\approx e^{-|z|}$ for $z<0$.
The SWISH activation function is defined as $$\sigma(z)=\frac{z}{1+e^{-z}},$$ which approximates the ReLU, for $z>0$ behaves as $\sigma(z)\approx z$, for $z<0$ behaves as $\sigma(z)\approx 0$.
The GELU(Gaussian Error Linear Unit) activation function is a lot like the SWISH. It's given by the expression $$\sigma(z)=\left[ \frac{1}{2}+\frac{1}{2}\mbox{erf} \left( \frac{z}{\sqrt{2}}\right)\right] \times z,$$ where the error function $\mbox{erf}(z)$ is given by $$\mbox{erf}(z)\equiv \frac{2}{\sqrt{\pi}}\int^z_0 dt e^{-t^2},$$ which is a partial integration of the Gaussian function.
2.3 Ensembles
Initialization distribution of biases and weights
To train neural network effectively, we have to set good initialization distribution. For example, let's initialize MLPs with zero-mean Gaussian distribution $$\mathbb{E}\left[ b^{(l}_{i_1}b^{(l)}_{i_2}\right] = \delta_{i_1i_2}C_b^{(l)},\\ \mathbb{E}\left[ W^{(l)}_{i_1j_1}W^{(l)}_{i_2j_2}\right] = \delta_{i_1i_2}\delta_{j_1j_2}\frac{C^{(l)}_W}{n_{l-1}},$$ respectively. Explicitly, the functional forms of these Gaussian distributions are given by $$p\left( b_i^{(l)}\right) = \frac{1}{\sqrt{2\pi C_b^{(l)}}} \exp \left[ -\frac{1}{2C_b^{(l)}} \left( b_i^{(l)}\right)^2\right],\\ p\left( W_{ij}^{(l)}\right) = \sqrt{\frac{n_{l-1}}{2\pi C_W^{(l)}}} \exp \left[ -\frac{n_{l-1}}{2C_W^{(l)}} \left( W_{ij}^{(l)}\right)^2\right].$$ The set of bias variances $\left\{ C^{(1)}_b,\cdots,C^{(L)}_b\right\}$ and the set of rescaled weight variances $\left\{ C^{(1)}_W,\cdots, C^{(L)}_W\right\}$ are called initialization hyperpatameters.
Induced distributions
Given a dataset $\mathcal{D}=\{x_{i,\alpha}\}$ consisting of $N_\mathcal{D}$ input data, and MLP with model paramteres $\theta_\mu =\left\{ b_i^{(l)},W_{ij}^{(l)}\right\}$ evaluated on $\mathcal{D}$ outputs an array of $n_L\times N_\mathcal{D}$ numbers $$f_i(x_\alpha;\theta)=z_i^{(L)}(x_\alpha)\equiv z_{i;\alpha}^{(L)},$$ indexed by both neural indices $i=1,\cdots,n_L$ and sample indices $\alpha=1,\cdots,N_\mathcal{D}$. Initialization distribution induces a distribution on the network outputs.
This output distribution $p\left( z^{(L)}|\mathcal{D}\right)$ controls the statistics of network outputs at the point of initialization. In theory, we need to calculate the following gigantic integral over all the model parameters $$p\left( z^{(L)} | \mathcal{D}\right) = \int \left[ \prod^P_{\mu=1} d\theta_\mu \right] p \left( z^{(L)}| \theta,\mathcal{D}\right) p(\theta).$$ This is deterministic, but don't know how.
Deterministic distributions and the Dirac delta function
What kind of a probability distribution is deterministic? Let's abstractly denote such a distribution as $p(z|s)=\delta (z|s)$, which intend to encode the deterministic relationship $z=s$. What properties should this distribution have? First, the mean of $z$ should be $s$ $$\mathbb{E}[z]=\int dz \delta(z|s)z\equiv s.$$ Second, the variance should vanish, since this is a deterministic relationship. In other words, $$\mathbb{E}[z^2]-(\mathbb{E}[z])^2=\left[ \int dz \delta(z|s)z^2\right] -s^2\equiv 0,$$ or equivalently, $$\int dz \delta (z|s)z^2 \equiv s^2.$$ In fact, this determinism implies na even stronger condition. In particular, the expectation of any function $f(z)$ of $z$, should evaluate to $f(s)$: $$\mathbb{E}[f(z)]=\int dz \delta(z|s)f(z)\equiv f(s),$$ which is the defining property of the Dirac delta function.
Dirac delta function is also consider a normalized Gaussian distribution with mean $s$ and take the limit as the variance $K$ goes to zero: $$\delta(z|s)\equiv \lim_{K\rightarrow +0} \frac{1}{\sqrt{2\pi K}} e^{-\frac{(z-s)^2}{2K}}.$$
The limit is should always be taken after integrating the distribution against some function. For example, insert $1$ on right hand side as $$\delta(z|s)=\lim_{K\rightarrow +0} \frac{1}{\sqrt{2\pi K}} e^{-\frac{(z-s)^2}{2K}}\left\{ \frac{1}{\sqrt{2\pi /K}}\int^\infty_{-\infty} d\Lambda \exp \left[ -\frac{K}{2} \left( \Lambda - \frac{i(z-s)}{K}\right)^2\right]\right\}\\ =\lim_{K\rightarrow +0} \frac{1}{2\pi} \int^\infty_{-\infty} d\Lambda \exp \left[ -\frac{1}{2}K\Lambda^2 + i\Lambda (z-s)\right],$$ where in the curly brackets we inserted an integral overa dummy variable $\Lambda$ of a normalized Gaussian with variance $1/K$ and imaginary mean $i(z-s)/K$, and on the second line we simply combined the exponentials. Then the limit $K\rightarrow +0$ to find an ingeral representation of the Dirac delta function $$\delta(z|s)=\frac{1}{2\pi}\int^\infty_{-\infty} d\Lambda e^{i\Lambda (z-s)}\equiv \delta(z-s).$$
Induced distributions, redux
Now we can express output distribution more concretely. To start, for a one-layer network of depth $L=1$, the distribution of the first layer outpuit is given by $$p\left( z^{(1)}|\mathcal{D}\right) = \int \left[ \prod^{n_1}_{i=1}db_i^{(1)} p\left( b_i^{(1)}\right)\right] \left[ \prod^{n_1}_{i=1}\prod^{n_0}_{j=1} dW^{(1)}_{ij}p\left( W_{ij}^{(1)}\right)\right] \times \left[ \prod^{n_1}_{i=1}\prod_{\alpha\in \mathcal{D}}\delta \left( z^{(1)}_{i;\alpha}-b^{(1)}_i-\sum^{n_0}_{j=1}W^{(1)}_{ij}x_{j;\alpha}\right)\right].$$ Here, we needed $n_1\times N_\mathcal{D}$ Dirac delta functions, one for each component of $z_{i;\alpha}^{(1)}$. For general layer, $$p\left( z^{(l+1)}|z^{(l)}\right) = \int \left[ \prod^{n_{l+1}}_{i=1}db_i^{(l+1)} p\left( b_i^{(l+1)}\right)\right] \left[ \prod^{n_{l+1}}_{i=1}\prod^{n_l}_{j=1} dW^{(l+1)}_{ij}p\left( W_{ij}^{(l+1)}\right)\right] \times \left[ \prod^{n_{l+1}}_{i=1}\prod_{\alpha\in \mathcal{D}}\delta \left( z^{(l+1)}_{i;\alpha}-b^{(l+1)}_i-\sum^{n_l}_{j=1}W^{(l+1)}_{ij}\sigma\left(z^{(l)}_{j;\alpha}\right) \right)\right].$$
More generally, for any parametrized model with output $z^{out}_{i;\alpha}\equiv f_i(x_\alpha;\theta)$ for $i=1,\cdots, n_{out}$ and with the model parapeterrs $\theta_\mu$ distributed according to $p(\theta)$, the output distribution can be written using the Dirac delta function as $$p\left( z^{out}|\mathcal{D}\right) = \int \left[ \prod^P_{\mu=1} d\theta_\mu \right] p(\theta) \left[ \prod^{n_{out}}_{i=1}\prod_{\alpha\in \mathcal{D}}\delta\left( z^{out}_{i;\alpha}-f_i(x_\alpha;\theta)\right)\right].$$