PDE NN
[2/26]
There are 3 important steps in computational sciences:
(i) problem definition $\rightarrow$ define idealization model.
(ii) mathematical modeling $\rightarrow$ introduce simplifying assumption
(iii) computer simulation $\rightarrow$ numerical solution by FDM, FEM, FVM.
Conservation Laws: integral and differential forms
The governing equations of continuum mechanics representing the kinematic and mechanical behavior of general bodies are define to as "conservation laws"
The general principle behind the derivation of conservation laws is the rate of change of $u(x,t)$ within a volume $V$. plus the flux of $u$ through the boundary $A$ is equal to the rate of production of $u$ (called $s(u,x,t)$)
(integral form) $$\begin{align}\frac{\partial}{\partial t}\int_V u(x,t)dV+\int_A f(u)\cdot n dA-\int_V S(u,x,t) dV\end{align}$$ For a fixed volume, we can apply Gauss' theorem. $$\int_V \nabla\cdot fdV=\int_A f\cdot ndA$$ to obtain
$$\begin{align}\int_V \left(\frac{\partial u}{\partial t}+\nabla\cdot f(u)-S\right) dV=0\end{align}$$
$$\begin{align}\Rightarrow\frac{\partial u}{\partial t}+\nabla\cdot f(u)-S=0\end{align}$$ (strong form differential form)
$$\begin{align}\Rightarrow\int_V\left(\frac{\partial u}{\partial t}+\nabla\cdot f(u)-S\right)w(x)dV=0\end{align}$$ ($w(x)$ is weight function)
If this equation is satisfied for any weight function $w(t)$, then this eq. is equivalent to (3). Also, by the integration by parts (I.B.P.)
$$\begin{align}\int_V \left[ \left( \frac{\partial u}{\partial t}-S\right) w(x) -f(u)\cdot \nabla w(x)\right] dV+\int_A f\cdot nw(x) dA=0\end{align}$$ (weak formulation)
Model equation and their classification
For the moment, we will restrict ourselves to ID case. $$\Omega=\{(x,t):0\le x\le 1,\ 0\le t\le T\}$$
$$\begin{align}\mathcal{L}(u)=\frac{\partial u}{\partial t}+\frac{\partial}{\partial x}\left( au-b\frac{\partial u}{\partial x}\right) -\gamma u=s\end{align}$$ ($a$: convection $b$: diffusion $c$: reaction) with appropriate boundary conditions (B.C.) at $x=0,1$.
$\mathcal{L}(u)$: linear differential operator.
This equation can be recast in the form (3) with $f(u)=au-\frac{\partial u}{\partial x}$, $S(u)=s+\gamma u$.
It is linear if the coefficient $a,b,\gamma$ and $s$ are function of $x$ and $t$, and non-linear if any of them depends on the solution $u$.
Remark. Elliptic, parabolic, hyperbolic PDEs (2D case) $$\begin{align}a(x_1,x_2)\frac{\partial^2 u}{\partial x_1^2}+b(x_1,x_3)\frac{\partial^2 u}{\partial x_1\partial x_2}+c(x_1,x_2)\frac{\partial^2 u}{\partial x_2^2}+d(x_1,x_2)\frac{\partial u}{\partial x_1}+e(x_1,x_2)\frac{\partial u}{\partial x_2}+f(x_1,x_2)u=g(x_1,x_2)\end{align}$$ then, this PDE is called
elliptic if $b^2-4ac<0$ $\rightarrow\ u_{x_1x_1}+u_{x_2x_2}=0$
hyperbolic if $b^2-4ac>0$ $\rightarrow\, u_{tt}-u_{x_1x_2}-u_{x_2x_2}=0$
parabolic if $b^2-4ac=0$ $\rightarrow\ u_t-u_{x_1x_2}-u_{x_2x_2}=0$
[2/28]
$\bullet$ Elliptic Equations: Helmholtz equations
$$\begin{align}(bu_x)_x+\gamma u=0\end{align}$$
$\bullet$ Parabolic Equations: Heat equations
$$\begin{align}u_t-(bu_x)_x=0\end{align}$$
$\bullet$ Hyperbolic Equations: linear advection equations
$$\begin{align}u_t+au_x=0\end{align}$$
Solution looks like $$\begin{align}u(x,t)=u_0(x-at)\end{align}$$
Numerical Scheme
Here, we introduce F.D.M. and F.E.M. For simplicity, we consider the case of a function of one variable $u(x)$.
Given a set at points $x_i$, $1\le i\le N$, the numerical solution that we are seeking is represented by a discrete set of function value $\{u_1,\cdots,u_N\}$ that approximate $u$ at these points.
Finite difference method
Consider $$\begin{align} u_x|_i=u_x(x_i)=\lim_{\Delta x\rightarrow 0}\frac{u(x_i+\Delta x)-u(x_i)}{\Delta x}=\lim_{\Delta x\rightarrow 0}\frac{u(x_i)-u(x_i-\Delta x)}{\Delta x}\nonumber\\ =\lim_{\Delta x\rightarrow 0}\frac{u(x_i+\Delta x)-u(x_i-\Delta x)}{2\Delta x}\end{align}$$ If $\Delta x$ is small but finute, $$u_x|_i\approx \frac{u_{i+1}-u_i}{\Delta x}\approx \frac{u_i-u_{i-1}}{\Delta x}\approx \frac{u_{i+1}-u_{i-1}}{2\Delta x}$$ (each called 'forward', 'backword', 'central'.)
An approximation to $u_{i+1}$ using $n+1$ terms is given by $$\begin{align}u_{i+1}=u_i+u_x|_i\Delta x+u_{xx}|_i\frac{\Delta x^2}{2}+\cdots +\left. \frac{d^n u}{dx^n}\right|_i \frac{\Delta x^n}{n!}+\frac{d^{n+1}u}{dx^{n+1}}(x^*)\frac{\Delta x^{n+1}}{(n+1)!}\end{align}$$ (last term is remainder, $x_i\le x^*\le x_{i+1}$.)
Consider the case of the forward difference approximation and us (12) with $n=1$ to obtain $$\frac{u_{i+1}-u_i}{\Delta x}=u_x|_i+\frac{\Delta x}{2}u_{xx}(x^*)$$
$$\Rightarrow u_x|_i=\frac{u_{i+1}-u_i}{\Delta x}+\epsilon_i,\quad \epsilon_i=-\frac{\Delta x}{2}u_{xx}(x^*)$$ and $\epsilon_i$ is referred to as the truncation error and is defined as the difference between the exact value and its numberical approximation.
The order of a FDM approximation is defined as the power $p$ s.t. $\lim_{\Delta x\rightarrow 0}\left( \frac{\epsilon_i}{\Delta x^p}\right)=\gamma\ne 0$ where $\gamma$ is a finite value.
This is often written as $\epsilon_i=O(\Delta x^p)$.
(e.g.) The forward approx., we have $\epsilon_i=O(\Delta x)$, first order.
(backworkd): $$u_x|_i=\frac{u_i-u_{i-1}}{\Delta x}+\frac{\Delta x}{2}u_{xx}x(^*)\quad \Rightarrow \epsilon_i=O(\Delta x)$$
(central): $$u_x|_i=\frac{u_{i+1}-u_{i-1}}{2\Delta x}-\frac{\Delta x^2}{12}u_{xxx}x(^{**})\quad \Rightarrow \epsilon_i=O(\Delta x^2)$$
where $x_{i-1}\le x^*\le x_i$ and $x_{i-1}\le x^{**}\le x_{i+1}$
A second-oder accarate F.D. approximation can be written as $$u_x|_i\approx \frac{\alpha u_{i+1}+\beta u_i+\gamma u_{i-1}}{\Delta x}$$
Using Taylor around $x_i$, we can write $$\begin{align}u_{i+1}=u_i+\Delta x\, u_x|_i+\frac{\Delta x^2}{2}u_{xx}|_i+\frac{\Delta x^3}{6}u_{xxx}|_i+\cdots\end{align}$$ $$\begin{align}u_{i-1}=u_i-\Delta x\, u_x|_i+\frac{\Delta x^2}{2}u_{xx}|_i-\frac{\Delta x^3}{6}u_{xxx}|_i+\cdots\end{align}$$ $$\begin{align}\Rightarrow \frac{\alpha u_{i+1}+\beta u_i+\gamma u_{i-1}}{\Delta x}=(\alpha+\beta+\gamma)\frac{1}{\Delta x}u_i+(\alpha-\gamma)u_x|_i+(\alpha+\gamma)\frac{\Delta x}{2}u_{xx}|_i+(\alpha-\gamma)\frac{\Delta x^2}{6}u_{xxx}|_i+(\alpha+\gamma)\frac{\Delta x^3}{12}u_{xxxx}|_i+O(\Delta x^4)\end{align}$$
We need 3 indep. condition to calculate $\alpha,\beta,\gamma$. We need to consider all of these scenarios
(i) $u=$const $\Rightarrow$ l.h.s. of (15)=0 $\Rightarrow (\alpha+\beta+\gamma)=0$
(ii) $u=$linear $\Rightarrow$ we must have $\alpha-\gamma=1$ to match $u_x|_i$
(iii) wile the first two conditions are necessary, we are free to choose the 3rd condition. $\Rightarrow$ the coefficient $(\Delta x/2)u_{xx}|_i=0$ to improve the accuracy $\Rightarrow \alpha+\gamma=0$.
All in all, $\alpha=\frac{1}{2}$, $\beta=0$, $\gamma=-\frac{1}{2}$. $\Rightarrow$ recover the 2nd-order accurate central form.
How about highr-order derivatives?
$$\frac{\alpha u_{i+1}+\beta u_i+\gamma u_{i-1}}{\Delta x^2}=(\alpha+\beta+\gamma)\frac{1}{\Delta x^2}u_i+(\alpha-\gamma)\frac{1}{\Delta x}u_x|_i+(\alpha+\gamma)\frac{1}{2}u_{xx}|_i+(\alpha-\gamma)\frac{\Delta x}{6}u_{xxx}|_i+O(\Delta x^4)$$
$$\Rightarrow \alpha +\beta+\gamma=0,\ \alpha-\gamma=0,\ \alpha+\gamma=2$$ no $u,u_x$ term, the r.h.s. approximates the l.h.s. as $\Delta x\rightarrow 0$.
$$\Rightarrow \alpha=\gamma=1,\ \beta=-2$$
$$u_{xx}|_i=\frac{u_{i+1}-2u_i+u_{i-1}}{\Delta x^2}+O(\Delta x^2)$$
Finite difference solution of PDEs
Solve $u_{xx}=s(x)$ in the region $\Omega=\{x:0\le x\le 1\}$
$\Delta x=\frac{1}{N-1}$ or $x_i=i-\frac{1}{N-1}$
we consider B.C. $$u(0)=\alpha_1,\quad u(1)=\alpha_2$$
$$\frac{u_{i+1}-2u_i+u_{i-1}}{\Delta x^2}=s_i \Delta x^2,\quad i=2,\cdots, N-1$$ Then we can write $N-2$ eqs with 2 conditions $$\begin{pmatrix}1&0&0&0&\cdots& 0\\ 1&-2&1&0&\cdots& 0\\ 0&1&-2&1&\cdots &0\\ 0&0&1&-2&\cdot&0\\ \vdots&\vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&0&\cdots&1\end{pmatrix}\begin{pmatrix}u_1\\ u_2\\ u_3\\ u_4\\ \vdots\\ u_N\end{pmatrix}=\begin{pmatrix} \alpha_1\\ \Delta x^2s_2\\ \Delta x^2s_3\\ \Delta x^3s_3\\ \vdots\\ \alpha_2\end{pmatrix}$$
[3/5]
Time integration
Consider $$u_t-bu_{xx}=s(x)\quad \mbox{in} \Omega=\{ (x,t):0\le x\le 1,\ 0\le t\le T\}$$ $$u(x,0)=u_0(x)$$ $$u(0,t)=\alpha_1(t),\quad u(1,t)=\alpha_2(t)$$
We first conside the "method of lines". We assign to our mesh a mesh a set of values that are functions of time $$u_i(t)=u(x_i,t)$$ $$\frac{du_i}{dt}=\frac{b}{\Delta x^2}\{ u_{i-1}(t)-2u_i(t)+u_{i+1}(t)\} +s_i,\quad i=2,\cdots,N-1$$ with $u_1=\alpha_1(t)$, $u_N=\alpha_2(t)$. Then, $$\frac{d}{dt}\begin{pmatrix}u_2\\ u_3\\ \vdots\\ u_{N-1}\end{pmatrix}=\frac{b}{\Delta x^2}\begin{pmatrix}-2&1&0&\cdots& 0\\ 1&-2&1&\cdots &0\\ \vdots&\vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots&-2\end{pmatrix}\begin{pmatrix}u_2\\ \vdots\\ u_{N-1}\end{pmatrix}+\begin{pmatrix}s_2+\frac{b\alpha_1(t)}{\Delta x^2}\\ s_3\\ \vdots\\ s_{N-2}\\ s_{N-1}+\frac{b\alpha_2(t)}{\Delta x^2}\end{pmatrix}$$
$$\Leftrightarrow \quad \frac{du}{dt}(t)=Au(t)+s(t)$$
semi-discrete form of method of lines.
Then, apply RK-type
For instance, $$\frac{u^{n+1}_i-u^n_i}{\Delta t}=\frac{b}{\Delta x^2}(u^n_{i-1}-2u^n_i+u^n_{i+1})$$ (explicit)
$$\frac{u^{n+1}_i-u^n_i}{\Delta t}=\frac{b}{\Delta x^2}(u^{n+1}_{i-1}-2u^{n+1}_i+u^n={n+1}_{i+1})$$ (implicit)
Finite element approximation
Consider the strong form of the PDE: $\mathcal{L}(u)=s$ then $$\int^1_0\mathcal{L}(u)w(k)dx=\int^1_0 s w(x)dx,$$ where $w(x)$ weight functioin.
$\Omega=\{x:0\le x\le 1\}$ into $N-1$ element $\Omega_i=\{x:x_{i-1}\le x\le x_i\}$ and $$u_h(x,t)=\sum^N_{i=1}u_i(t)N_i(x)$$ where $N_i$ is the basis.
In the Galerkin FEM method, we set $w(x)=N_i(x)$. Consider the B.C. as $u(0)=\alpha$, $u_x(1)=g$, then $$\int^1_0w(x)u_{xx}dx=\int^1_0w(x)s(x)dx,$$ where $\mathcal{L}=u_{xx}$ $$\Rightarrow -\int^1_0w_xu_xdx+w(1)u_x(1)-w(0)u_x(0)=\int^1_0w(x)s(x)dx$$ Set $u(x)\approx u_h(x)=\sum^N_{j=1}u_jN_j(x)$ and $w(x)=N_i(x)$, then $$-\int^1_0\frac{dN_i}{dx}(x)\sum^N_{j=1}u_j\frac{dN_j}{dx}(x)dx=\int^1_0N_i(x)s(x)dx$$ $\rightarrow$ we have $N-1$ eqs with $N-1$ unknowns.
Set $$N_i(x)=\begin{cases}\frac{x-x_{i-1}}{\Delta x_{i-1}}& x_{i-1}<x<x_i\\ \frac{x_{i+1}-x}{\Delta x_i},& x_i<x<x_{i+1}\end{cases}$$ where $\Delta x_{i-1}=x_i-x_{i-1}$, $\Delta x_i=x_{i+1}-x_i$ (It looks like mountains) $$\frac{dN_i}{dx}(x)=\begin{cases}\frac{1}{\Delta x_{i-1}}\\-\frac{1}{\Delta x_i}\end{cases}$$
$$\Rightarrow -\int^{x_i}_{x_{i-1}}\frac{dN_i}{dx}\left( u_{i-1}\frac{dN_{i-1}}{dx}+u_i\frac{dN_i}{dx}\right)-\int^{x_{i+1}}_{x_i}\frac{dN_i}{dx}\left( u_i\frac{dN_i}{dx}+u_{i+1}\frac{dN_{i+1}}{dx}\right)dx =\int^{x_i}_{x_{i-1}}N_isdx+\int^{x_{i+1}}_{x_i}N_isdx$$ (when $2\le i\le N-1$) The l.h.s. is written as $$\frac{-u_i-i_{i-1}}{\Delta x_{i-1}}+\frac{u_{i+1}-u_i}{\Delta x_i}=\frac{\Delta x_{i-1}+\Delta x_i}{2}s_i$$ Assuming $\Delta x_{i-1}=\Delta x_i=\Delta x$ uniform grid, then $$\frac{u_{i+1}-2u_i+u_{i-1}}{(\Delta x)^2}=\Delta x s_i$$ which is same as FDM approximation.
Universal Approximation Theorem
Theorem. Let $\sigma$ be an arbitrary continuous sigmoid function. Then, the finite sum of the form $$G(x)=\sum^N_{j=1}\alpha_j\sigma(w^T_jx+\theta_j),\quad w_j\in\mathbb{R}^n,\ \alpha_j,\theta_j\in\mathbb{R}$$ are dense in $C(I_n)$. ($I_n=[0,1]^n$)
To show this we need sure definitions and lemmas.
Definition. A function $\sigma:\mathbb{R}\rightarrow [0,1]$ is called sigmoid if $\sigma\in C(\mathbb{R})$ and $\lim_{x\rightarrow -\infty}\sigma(x)=0$ $\lim_{x\rightarrow \infty}\sigma(x)=1$
Remark. A measure $\mu$ can be regarded as a system of beliefs used to assess information, so $d\mu(x)=\mu(dx)$ represents an evaluetion of the information cast into $x$. Consequently, the integral $\in f(x)d\mu(x)$ represents the evaluation of the function $f(x)$ under the system of belief $\mu$.
Definition. Let $\mu\in M(I_r)$ (Baire measure). A function $\sigma$ is called discriminatory for the measure $\mu$ if $$\sigma(w^Tx+\theta)d\mu(x)=0,\quad \forall w\in\mathbb{R}^n,\ \forall \theta\in\mathbb{R}$$ $$\Rightarrow \mu=0$$
Note. If the evalution of the neuron output $\sigma(w^Tx+\theta)$ over all possible inputs $x$ under $\mu$ vanishes for any $\theta$ and $w$, then $\mu$ must vanish. We now state two lemmas with proof:
Lemma 1. Any continuous sigmoid function is dicriminatory for all measrue $\mu\in M(I_n)$
Lemma 2. Let $U$ be a linear subspace of a normed linaer space $X$ and consider $x_0\in X$ s.t. $\mbox{dist}(x_0,U)\ge \delta$ for some $\delta \le 0$, i.e. $$\lVert x_0-u\rVert \ge \delta,\quad \forall u\in U.$$ Then, $\exists$ a bdd linear functional $L$ on $X$ s.t.
(i) $\lVert L\rVert \le 1$
(ii) $L(u)=0,\quad \forall u\in U$ i.e. $L|_U=0$
(iii) $L(x_0)=\delta$.
pf) Skip.
The Lemma 2 can be reformulated using the lauguage of dense subspace. Recall that a subspace $U$ of $X$ is dense in $X$ with respect to the norm $\lVert\cdot\rVert$ if for any element $x\in X$, there are elements $u\in U$ as close as possible to $x$. Equivalently, $\forall x\in X$, $\exists$ a seq. $u_n\in U$ s.t., $u_n\rightarrow x$ as $n\rightarrow \infty$; that is, $\forall x\in X$, $\forall \epsilon\ge 0$, $\forall \epsilon u\in U$, s.t., $\lVert u-x\rVert <\epsilon$. Hence, the subspace $U$ is not dense in $X$ can be described as: there are elements $x_0\in X$ s.t., no element $n\in U$ are close enough to $x_0$: that is $\epsilon \delta >0$ s.t. $\forall u\in U$, we have $\lVert u-x_0\rVert\ge \delta$.
Lemma 3 (reformulation of Lemma). Let $U$ be a linaer, non-dense subset of a normed linear space $X$. Then, $\exists$ a bounded linear functional $L$ on $X$ s.t. $$L\ne 0\mbox{ on }X\mbox{ and } L|_U=0.$$$
Lemma 4. Let $U$ be a linear, non-dense subspace of $C( I_n)$. Then, $\exists$ a measure $\mu\in M(I_n)$ s.t. $$\int_{I_n}h\, d\mu=0,\quad \forall h\in U.$$
pf) Considering $X=C(I_n)$ in Lemma 3, $\exists$ a bdd linear functional $$L:C(I_n)\rightarrow \mathbb{R}$$ s.t., $L\ne 0$ on $C(I_n)$ and $L|_U=0$. Apply the represenatation theorem of linear bdd functionals on $C(I_n)$, $\exists$ a measure $\mu\in M(I_n)$ s.t. $$L(f)=\int_{I_n}fd\mu,\quad \forall f\in C(I_n).$$
Theorem (repsentation theorem). $F$: bdd, linear function on $C(k)\Rightarrow \exists \mu$ s.t. $F(f)=\int_K f(x)d\mu(x)$ when $f\in C(k)$.
In particular, for any $h\in U$, we obtain $$L(h)=\int_{I_n}hd\mu=0,$$ which is the desired results.
Remark. $L\ne 0$ $\Rightarrow \mu\nu0$.
Lemma 5. Let $\sigma$ be any continuous discriminatory function. Then, the finite sum of the form $$G(x)=\sum^N_{j=1}\alpha_j \sigma(w^T_jx+\theta_j),\quad w_j\in\mathbb{R}^n,\ \alpha_j,\theta_j\in\mathbb{R}$$ are dense in $C(I_n)$.
pf) Since $\sigma$ is continuous, if follow that $$U=\{G:G(x)=\sum^N_{j=1}\alpha_j\sigma(w^T_jx+\theta_j)\}$$ is a linear subspace of $C(I_n)$. We continue the proof by contradiction. Assume that $U$ is not dense in $C(I_n)$. By Lemma 4, $\exists$ a measure $\mu\in M(I_n)$ s.t. $$\int_{I_n}hd\mu=0\quad \forall h\in U.$$ This also can be written as $$\sum^N_{j=1}\alpha_j\int_{I_n}\sigma(w^T_jx+\theta_j)d\mu=0,\quad \forall w_j\in\mathbb{R}^n,\ \alpha_j,\theta_j\in\mathbb{R}.$$ By choosing convenient coefficient, $\alpha_j$, we obtain $$\int_{I_n}\sigma(w^Tx+\theta)d\mu=0,\quad \forall w\in\mathbb{R}^n,\ \theta\in \mathbb{R}.$$ Using that $\sigma$ is discriminatory, yield $\mu=0$, which is a contradiction.
This result states that $\forall f\in C(I_n)$ and $\forall \epsilon>0$, $\exists G(x)$ of the previous form s.t., $|G(k)-f(x)|<\epsilon$.
Theorem (Cybewko, 1989), Let $\sigma$ be an arbitrary continuous sigmoid function. Then, $$G(x)=\sum^N_{j=1}\alpha_j\sigma(w^T_jx+\theta_j)$$ where $w_j\in\mathbb{R}^n$, $\alpha_j,\theta_j\in\mathbb{R}$ are dense in $C(I_n)$.
pf) By Lemma 1, any continuous sigmoid function is discriminatory. Then, applying Lemma 5 yields the desired result.
Neural Network
1. Logistic regression
Binary classification.
Given $x$ ($x\in\mathbb{R}^{n_x}$), want $\hat{y}=P(y=1|x)$
Parameter: $w\in\mathbb{R}^{n_x}$, $b\in\mathbb{R}$.
output $\hat{y}=w^Tx+b$ ($z=w^Tx+b$) but we want $0\le \hat{y}\le 1$ here
$\Rightarrow \hat{y}=\sigma(w^Tx+b)$, $\sigma$: sigmoid function $\sigma(z)=\frac{1}{1+e^{-z}}$.
2. Gradient descent
[3/7]
[3/11&13]
Homework 1
Problem 1 (Universal Approximation Theorem)
Let's consider learning square integrable functions $f\in L^2(I_n)$. Let's start with the definition of the discriminatory:
Definition. The activation function $\sigma$ is called discriminatory in $L^2$ sense if:
(i) $0\le \sigma \le 1$;
(ii) if $g\in L^2(I_n)$ such that $$\int_{I_n}\sigma(\omega^Tx+\theta)g(x)dx=0,\quad \forall \omega\in\mathbb{R}^n,\ \theta\in\mathbb{R},$$ then $g=0$ almost everywhere.
Prove that following lemma: Let $\sigma$ be discriminatory function in $L^2$ sense. Then, the finite sums of the form $$G(x)=\sum^N_{j=1}\alpha_j\sigma(\omega^T_jx+\theta_j),\quad \forall \omega_j\in\mathbb{R}^n,\ \theta_j,\alpha_j\in\mathbb{R}$$ are dense in $L^2(I_n)$.
Sol)
If $G(x)$ is not dense, then there exists function $f\in L^2(I_n)$ such that for all finite sums $G(x)$, $$|f-G|_{L^2}\geq \epsilon$$ for some $\epsilon>0$.
Then since $L^2(I_n)$ is a Hilbert space, the Riesz Representation Theorem tells us that for ebery continuous linear functional $L$ on $L^2(I_n)$, there exists a unique function $h\in L^2(I_n)$ such that $L(g)=\langle g,h\rangle$ for all $g\in L^2(I_n)$.
The function $f$ is not approximated by any $G(x)$, must be orthogonal to the subspace spanned by all possible finite sums of $\sigma(w^Tx+\theta)$, which is a contradiction if we can show this subspace is dense.
Let's define a linear functional $L$ by $L(G)=\langle f,G\rangle$ for all $G$ of the form given in the Lemma. By our assumption, $L(G)$ cannot be zero for all $G$, otherwise $f$ would be approximated by such sums $G(x)$.
Since $\sigma$ is discriminatory, for $L(G)=0$ to hold for all $G$, it must be that $f=0$ almost everywhere, which is a contradiction to our assumption of $f$.
Finally, the finite sums $G(x)$ must be dense in $L^2(I_n)$ because of contradiction.
Problem 2
We have the following lemma without proof:
Lemma. Let $g\in L^2(I_n)$ such that $\int_\mathcal{H}g(x)dx=0$, for any half-space $\mathcal{H}:=\{x:\omega^Tx+\theta>0\}\cup I_n$. Then $g=0$ almost everywhere.
We note that by choosing a convenient value for the parameter $\theta$, the upper half-space may become the entire hypercube. Then, $g$ considered before has zero integral $\int_{I_n}g(x)dx=0$. Now, show that any function $g\in L^2(I_n)$ can be approximated by the output of a one-hidden layer perceptron where the activation function $\sigma(x)$ is the Heaviside step function: $$\sigma(x)=\begin{cases}1,&x\ge 0,\\ 0,& x<0.\end{cases}$$
Sol)
Problem 3: Least-squares derivatives.
Let $X_1,\cdots,X_N\in\mathbb{R}^\rho$ and $Y_1,\cdots,Y_N\in\mathbb{R}$. Define $$X=\begin{bmatrix}X^T_1\\ \vdots\\ X^T_N\end{bmatrix}\in\mathbb{R}^{N\times p},\quad Y=\begin{bmatrix}Y_1\\ \vdots \\ Y_N\end{bmatrix}\in\mathbb{R}^N.$$ Let $$l_i(\theta)=\frac{1}{2}(X^T_i\theta-Y_i)^2\ \mbox{for }i=1,\cdots,N,\quad \mathcal{L}(\theta)=\frac{1}{2}\lVert X\theta-Y\rVert^2.$$ Show (a) $\nabla_\theta l_i(\theta)=(X^T\theta-Y_i)X_i$ and (b) $\nabla_\theta\mathcal{L}(\theta)=X^T(X\theta-Y)$.
Hint. For part (a), start by computing $\frac{\partial}{\partial\theta_j}l_i(\theta)$. For part (b), use the fact that $$Mv=\sum^N_{i=1}M_{:i}v_i\in\mathbb{R}^p$$ for any $M\in\mathbb{R}^{p\times N}$, $v\in\mathbb{R}^N$, where $M_{:i}$ is the $i$th column of $M$ for $i=1,\cdots,N$.
Sol)
(a)
$l_i(\theta)=\frac{1}{2}$ where $u=X^T_i\theta-Y_i$.
The derivative of $u^2$ with respect to $u$ is $2u$, and the derivative of $u$ with respect to $\theta$ is $X_i$ by matrix differentiation.
Apply the chain rule: $$\frac{\partial}{\partial \theta}l_i(\theta)=\frac{\partial l_i}{\partial u}\cdot \frac{\partial u}{\partial u_i}=2\cdot \frac{1}{2}(X^T_i\theta-Y_i)\cdot X_i=(X_i^T-Y_i)X_i.$$ Thus, $\nabla_{\theta}l_i(\theta)=(X_i^T\theta-Y_i)X_i$.
(b)
Using the fact that for any matrix $M\in\mathbb{R}^{p\times N}$ and vector $v\in\mathbb{R}^N$, $Mv=\sum_{i=1}^NM_{\cdot i}v_i$, where $M_{\cdot i}$ is the $i$the column of $M$.
We can write $\mathcal{L}(\theta)$ as sum of square terms: $$\mathcal{L}(\theta)=\frac{1}{2}\sum^N_{i=1}(X^T_i\theta-Y_i)^2.$$
The gradient of $\mathcal{L}(\theta)$ with respect to $\theta$ is the sum of the gradients of each squared term: $$\nabla_\theta\mathcal{L}(\theta)=\sum^N_{i=1}\nabla_\theta\left( \frac{1}{2}(X^T_i\theta-Y_i)^2\right)=\sum^N_{i=1}(X^T_i\theta-Y_i)X_i.$$
Expressing this in matrix form, we get: $$\nabla_\theta\mathcal{L}(\theta)=X^T(X\theta-Y).$$
Problem 4
The following is famous Stone-Weierstrass Theorem:
(Stone-Weierstrass Theorem). Let $K$ be a compact set in $\mathbb{R}^n$ and $\mathcal{A}$ an algebra of continuous real-valued functions on $K$ satisfying the properties:
(i) $\mathcal{A}$ separates the points of $k$, i.e. if for any distinct $x,y\in K$, there is an $f$ in $\mathcal{A}$ such that $f(x)\ne f(y)$;
(ii) $\mathcal{A}$ contains the constant functions.
Then, $\mathcal{A}$ is a dense subset of $C(K)$.
Using this theorem without proof, show the following:
Let $\varphi:\mathbb{R}\rightarrow\mathbb{R}$ be any continuous nonconstant activation function. Then the following function $$G(x)=\sum^M_{k=1}\beta_k\prod^{N_k}_{j=1}\varphi(\omega^T_{jk}+\theta_{jk}),\\ \omega_{jk}\in\mathbb{R}^n,\ \beta_j,\theta_{jk}\in\mathbb{R},\ x\in I_n,\ M,N_k=1,\cdots,$$ are dense in $C(I_n)$.
Hint: Consider the set $\mathcal{U}$ given by the finite sums of products of the type (1). Without proof, we can say $\mathcal{U}$ an algebra of real continous functions on $I_n$.
Sol)
Since $\varphi$ is nonconstant, for any two distince points $x,y\in I_n$, we can find parameters $w,\theta$ such that $\varphi(w^Tx+\theta)\neq \varphi(w^Ty+\theta)$. Therefore, functions of the form $G(x)$ can separate points in $I_n$.
The constant function can be represented as $G(x)=c$ by choosing $\beta_k=c$, $N_k=1$, and $\varphi(w_{jk}^Tx+\theta_{jk})=1$ since $\varphi$ is continuous and nonconstant, so there exists some input that maps to 1.
If $G_1(x)$ and $G_2(x)$ are in $\mathcal{U}$, then their sum $G_1(x)+G_2(x)$ is also in $\mathcal{U}$, and so is their product $G_1(x)\cdot G_2(x)$, since the product of sums of products is again a sum of products.
For any scalar $c$ and function $G(x)\in \mathcal{U}$, the function $cG(x)$ is also in $\mathcal{U}$.
Then $\mathcal{U}$ satisfies the conditions of the Stone-Weirstrass theorem, it is dense in $C(I_n)$.
[Homework 2]
Problem 2
(a) Consider the 2-layer neural network $$f_\theta(x)=u^T\sigma(ax+b)=\sum^p_{j=1}u_j\sigma(a_jx+b_j),$$ where $x\in \mathbb{R}$ and $a,b,u\in \mathbb{R}^p$. Let $\sigma$ be the ReLU activation function. Using the data $X_1,\cdots,X_N\in\mathbb{R}$ and labels $Y_1,\cdots,Y_N\in\mathcal{Y}$, we train the neural network by solving $$\mbox{minimize}_{\theta\in\mathbb{R}^{3p}} \quad \frac{1}{N}\sum^N_{i=1}l(f_\theta(X_i),Y_i)$$ with SGD. We assume $l(x,y)$ is differentiable in $x$. Assume the $j$-th ReLU output is "dead" at initialization in the sense that $a^0_jX_i+b^0_j<0$ for all $i=1,\cdots,N$. Show that $j$-th ReLU output remains dead throughout the training.
(b) The leaky ReLU activation function is defined as $$\sigma(z)=\begin{cases}z & \mbox{for}\ z\ge 0\\ \alpha z& \mbox{otherwize,}\end{cases}$$ where $\alpha$ is a fized parameter ($\alpha$ is not trained) often set to $\alpha=0.01$. Show that leaky ReLU, instead of ReLU, is used in the previous problem, the gradient no longer exactly vanishes.
Sol)
This is well known as "dead ReLU" problem.
(a) The gradient of ReLU is $$\nabla \sigma=\begin{cases}0& (x<0)\\ 1 &(x>0)\end{cases}$$ Since the $j$-th ReLU output is "dead" ($a_jx_i+b_j\le 0$ for all $i$, and hence the gradient w.r.t. $a_j$ and $b_j$ will be $0$). During training with Stochastic Gradient Descent, parameters are updated based on the gradient of the loss function w.r.t. the parameters. Since the output of the dead ReLU is $0$ for all inputs, the gradient of the loss function w.r.t. $a_j$ and $b_j$ will remain 0 throughout training. Thus, $a_j$ and $b_j$ will not update during training, which means that the $j$-th ReLU output remains dead.
(b) The gradient of leaky ReLu is $$\nabla \sigma=\begin{cases}\alpha& (x<0)\\ 1 &(x>0)\end{cases}$$ Like before, $j$-th leaky ReLU output is not zero, but $\alpha\cdot (a_jx_i+b_j)$. This means that during the training with SGD, these parameters will still be updated even if the initial output for these units is less than zero.
Problem 3
Let $\sigma:\mathbb{R}\rightarrow \mathbb{R}$ be a differentiable activation function and consider the following multi-layer perceptron $$y_L=A_L y_{L-1}+b_L\\ y_{L-1}=\sigma(A_{L-1}y_{L-2}+b_{L-1})\\ \vdots\\ y_2=\sigma(A_2y_1+b_2)\\ y_1=\sigma(A_1x+b_1),$$ where $x\in \mathbb{R}^{n_0}$, $A_l\in \mathbb{R}^{n_l\times n_{l-1}}$, $b_l\in \mathbb{R}^{n_l}$, and $n_L=1$. (To clarify, $\sigma$ is applied element-wise.) For notational convenience, define $y_0=x$.
Show $$\frac{\partial y_L}{\partial b_L}=1,\quad \frac{\partial y_L}{\partial y_{L-1}}=A_L,\\ \frac{\partial y_l}{\partial b_l}=\mbox{diag}(\sigma'(A_ly_{l-1}+b_l)),\quad \mbox{for } l=1,\cdots, L-1\\ \frac{\partial y_l}{\partial y_{l-1}}=\mbox{diag}(\sigma'(A_ly_{l-1}+b_l))A_l,\quad \mbox{for }l=2,\cdots,L-1,$$ where $\frac{\partial y_l}{\partial b_l}\in\mathbb{R}^{n_l\times n_l}$ and $\frac{\partial y_l}{\partial y_{l-1}}\in\mathbb{R}^{n_l\times n_{l-1}}$ are Jacobian matrices. (For any $v\in\mathbb{R}^k$, we define $\mbox{diag}(v)$ to be the $k\times k$ diagonal matrix with $v_1,\cdots,v_k$ as its diagonal entries.) hint: compute first $$\left( \frac{\partial y_l}{\partial b_l}\right)_{ij}=\frac{\partial (y_l)_i}{\partial (b_l)_j},\quad \left(\frac{\partial y_l}{\partial y_{l-1}}\right)_{ij}=\frac{\partial (y_l)_i}{\partial (y_{l-1})_j}.$$
Sol)
From $$y_L=A_L y_{L-1}+b_L$$, we can get $$\frac{\partial y_L}{\partial b_L}=1,\quad \frac{\partial y_L}{\partial y_{L-1}}=A_L$$ Next, $$\left( \frac{\partial y_l}{\partial b_l}\right)_{ij}=\frac{\partial (y_l)_i}{\partial (b_l)_j}=\frac{\partial}{\partial (b_l)_j}\sigma(A_l y_{l-1}+b_l)=\delta^i_j \sigma'(A_ly_{l-1}+b_l)$$ which is same as "diag". Finally, $$\left(\frac{\partial y_l}{\partial y_{l-1}}\right)_{ij}=\frac{\partial (y_l)_i}{\partial (y_{l-1})_j}=\frac{\partial}{\partial (y_{l-1})_j}\sigma(A_ly_{l-1}+b_l)_i=\delta^i_k\sigma'(A_ly_{l-1}+b_l)(A_l)_{kj}$$ which is same as "diag" times $A_l$.