Universal Approximation Theorem
Deep learning is the most universal data processing device humans can imagine.
The theoretical basis of deep learning is the universal approximation theorem, and today we will learn about it in detail.
There are several conditions for the depth and width of artificial neural networks, here we will look at limited depth and single hidden layer.
Universal Approximation Theorem: Let $C(\mathbf{X},\mathbb{R}^m)$ denote the set of continuous functions from a subset $\mathbf{X}$ of a Euclidiean $\mathbb{R}^n$ space to a Euclidean space $\mathbb{R}^m$. Let $\sigma \in C(\mathbb{R},\mathbb{R})$. Note that $(\sigma \circ x)_i = \sigma (x_i)$, so $\sigma \circ x$ denotes $\sigma$ applied to each component of $x$.
Then $\sigma$ is not polynomial if and only is for every $n \in \mathbb{N}$, $m \in \mathbb{N}$, compact $\mathbf{K} \subseteq \mathbb{R}^n$, $f \in C(\mathbf{K}, \mathbb{R}^m)$, $\epsilon > 0$ there exists $k \in \mathbb{N}$, $A \in \mathbb{R}^{k\times n}$, $b \in \mathbb{R}^k$, $C \in \mathbb{R}^{m \times k}$ such that $\sup_{x \in \mathbf{K}} \lVert f(x) - g(x)\rVert < \epsilon$ where $g(x) = C \cdot (\sigma \circ (A \cdot x + b))$
Proof: When prove case $m=1$, higher dimension is obiouse because uniform convergence in $\mathbb{R}^n$ is same as uniform convergence in each coordinate.
Let $F_\sigma$ be the set of all one-hidden-layer neural networks constructed with $\sigma$. Let $C_0(\mathbb{R}^d,\mathbb{R})$ be the set of all $C(\mathbb{R}^d,\mathbb{R})$ with compact support.
If the function is a polynomial of degree $d$, then $R_\sigma$ is contained in the closed subspace of all polynomials of degree $d$, so tis closure is also contained in it, which is not all of $C_0(\mathbb{R}^d,\mathbb{R})$.
Otherwise, we show that $F_\sigma$'s closure if all of $C_0(\mathbb{R}^d,\mathbb{R})$. Suppose we can construct arbitrarily good approximations of the ramp function $r(x)=\begin{cases} -1 & \mbox{if } x < -1 \\ x & \mbox{if } \left\vert x \right\vert \le 1 \\ 1 & \mbox{if } x > 1 \end{cases}$ then is can be combined to construct arbitrary compactly-supported continuous function to arbitrary precision. It remains to approximate the ramp function.
Activation function $\sigma$ can approximate ramp function. For example, ReLU can do it like $r(x)=\sigma(x) - \sigma(1-x)$.
Ramp function can make bump function, $r(\frac{x-a}{b-a})-r(\frac{d-x}{d-c})$.
This approximate dirac delta function, so it can also approximate any kind of function.
(We can manipulate function locally!!)
Reference
Approximation by Superpositions of a Sigmoidal Function (1989), by G. Cybenko
Multilayer feedforward networks are universal approximators (1989), by Kurt Hornik
Wikipidia - Universal approximation theorem
Stack Exchange - Where can I find the proof of the universal approximation theorem?
Old proof (use functional analysis):
https://mcneela.github.io/machine_learning/2017/03/21/Universal-Approximation-Theorem.html
https://numerics.ovgu.de/teaching/psnn/2122/handout_ahmet.pdf