[Deep Learning Principle] 0. Initialization

 This acticle is one of Textbook Comentary Project.


Nowadays AI development came from deep learning. Deep learning is successful approach that imitate human brain: Neural Network. It train from big data and perform any task perfectly.


0. Prerequisite

1) Deep Learning

Gradient Descent, Back Propagation

2) (Statistical)Field Theory

Propagator, PerturbationRenormalization Group Flow, Effective Field

3) Regression Analysis

Bayesian Statistics, Gaussian Process



1. An Effective Theory Approach

We know microscopic description of neural network. (Traning Algorithm)

We don't know macroscopic description 'why they train well'.

Effective field theory is process that finding simple (emergent) theory from complicated system.



2. The Theoretical Minimum

Deep Neural Network

neural network is a parameterized function $$\begin{align}f(x;\theta)\end{align}$$ where $x$ is the input to the function and $\theta$ is a vector of a large number of parameters controlling the shpae of the function. This is the algorithm for tune the vector $\theta$

1). We initialize the network by randomly sampling the parameter vector $\theta$ from a computationally simple probability distribution, $$\begin{align}p(\theta).\end{align}$$(We will see why set initialization distribution $p(\theta)$.)

2). We adjust the parameter vector as $\theta\rightarrow \theta^\star$, such that the resulting network function $f(x;\theta^\star)$ is as close as possible to a desired target function $f(x)$: $$\begin{align}f(x;\theta^\star)\approx f(x).\end{align}$$ This is called function approximation.(See Universal Approximation Theorem) To find these tunings $\theta^\star$, we fit the network function $f(x;\theta)$ to training data, consisting of many pairs of the form $(x,f(x))$ observed from the desired - but only partially observable - target function $f(x)$. Overall, making these adjustments to the parameters is called training, and the particular procedure use to tune them is called a learning algorithm.


Our goal is to understand this trained network function: $$\begin{align}f(x;\theta^\star).\end{align}$$ This means

1) Understand macroscopic behavior from microscopic description of the network.

2) Understand how the function approximation (3) works and evaluate how $f(x;\theta^\star)$ uses the training data $(x,f(x))$ in its approximation of $f(x)$.


Problems of Taylor Expand: Infinite Degree

One way is using Taylor expand our trained network function $f(x,\theta^\star)$ around the initialized value of the parameters $\theta$, ignore multi-variable, $$\begin{align}f(x;\theta^\star)=f(x;\theta)+(\theta^\star-\theta)\frac{df}{d\theta}+\frac{1}{2}(\theta^\star-\theta)^2\frac{d^2f}{d\theta^2}+\cdots,\end{align}$$ where $f(x;\theta)$ and its derivatives on the right-hand side are all evaluated at initialized value of the parameters. This have three problems:

1) We need to know all derivatives $$\begin{align}f,\frac{df}{d\theta},\frac{d^2f}{d\theta^2},\cdots.\end{align}$$

2) For initialize $\theta$ each time, we get different function $f(x;\theta)$. Initialization induces a distribution over the network function and its derivatives (expanding ramdomness), and we need to determine the mapping, $$\begin{align}p(\theta)\rightarrow p\left( f,\frac{df}{d\theta},\frac{d^2f}{d\theta^2},\cdots\right).\end{align}$$

3) We have to tune from every variables. $$\begin{align}\theta^\star\equiv [\theta^\star]\left( \theta,f,\frac{df}{d\theta},\frac{d^2f}{d\theta},\cdots;\mbox{ learning algorithm; training data}\right) .\end{align}$$

If we could solve three problems, we can fine a distribution over trained network functions $$\begin{align}p(f^\star)\equiv p\left( f(x;\theta^\star) | \mbox{learning algorithm; training data}\right) ,\end{align}$$ now conditioned in a simple way one the learning algorithm and the data we used for training.


A Principle of Sparsity: Gaussian Distribution

Deep neural network architecture are its width $n$ and its depth $L$. To grow size of neural network means increase its width $n$ holding its depth $L$ fixed, or increase its depth $L$ holdint its width $n$ fixed. First think this limit $$\begin{align}\lim_{n\rightarrow \infty} p(f^\star),\end{align}$$ and studying an idealized neural network in this limit. This is known as the infinite-width limit of the network. Then problems become simple.

1) Curvature(Hessian) of local structure vanish when $n\rightarrow \infty$. All the higher derivative terms $d^kf/d\theta^k$ for $k\ge 2$ will effectively vanish, we just consider $$\begin{align} f, \frac{df}{d\theta}.\end{align}$$ This looks network linearlized. Later, we call this neural tangent kernel(NTK).

2) $$\begin{align}\lim_{n\rightarrow \infty} p\left( f,\frac{df}{d\theta},\frac{d^2f}{d\theta^2},\cdots\right)=p(f)p\left( \frac{df}{d\theta}\right),\end{align}$$

3) $$\begin{align}\lim_{n\rightarrow \infty} \theta^\star = [\theta^\star]\left( \theta,f,\frac{df}{d\theta};\mbox{training data}\right) .\end{align}$$

As a result, the trained distribution (10) is a simple Gaussian distribution with a nonzero mean. However, this is not good model because the parameter $\theta$ can't tuned(trained). We will see it later. (This is same sense that free field theory is not funny.)



Interacting (Perturbation) Theory: Nearly-Gaussian Distribution

The problem with this limit is the washing out of the fine details at each neuron due to the consideration of an infinite number of incoming signals. We have to restore interaction between neurons that are present in realistic finite-width networks.


Use $1/n$ expansion, treating the inverse layer width, $\epsilon\equiv 1/n$. as our small parameter of expansion: $\epsilon\ll 1$: $$\begin{align} p(f^\star)\equiv p^{\{0\}}(f^\star)+\frac{p^{\{1\}}(f^\star)}{n}+\frac{p^{\{2\}}(f^\star)}{n}+\cdots,\end{align}$$ where $p^{\{0\}}(f^\star)\equiv \lim_{n\rightarrow \infty} p(f^\star)$ is the infinite-width limit we discussed above, (10), and the $p^{\{k\}}(f^\star)$ for $k\ge 1$ give a series of corrections to this limit.


In this book, we'll in particular compute the first such correction, truncating the expansion as $$\begin{align} p(f^\star)\equiv p^{\{0\}}(f^\star)+\frac{p^{\{1\}}(f^\star)}{n}+O\left(\frac{1}{n^2}\right).\end{align}$$

This model is still simple:

1) All higher derivative terms $d^kf/d\theta^k$ for $k\ge 4$ will effectively give contributions of the order $1/n^2$ or smaller, meaning that to capture the leading contributions of order $1/n$, we only need to keep track of four terms: $$\begin{align} f,\frac{df}{d\theta},\frac{d^2f}{d\theta^2},\frac{d^3f}{d\theta^3}.\end{align}$$

2) $$\begin{align}p\left( f,\frac{df}{d\theta},\frac{d^2f}{d\theta^2},\frac{d^3f}{d\theta^3}\right)\end{align}$$

3) $$\begin{align}\theta^\star = [\theta^\star]\left( \theta,f,\frac{df}{d\theta},\frac{d^2f}{d\theta^2},\frac{d^3f}{d\theta^3};\mbox{training data}\right) .\end{align}$$



Reference

The Principle  of Deep Learning Theory

Image - DALL.E