Dynamic Programing (Bellman's PDE)



 Let's consider $$\begin{cases} \dot{\mathbf{x}}(s)=\mathbf{f}(\mathbf{x}(s),\mathbf{\alpha}(s))& (t\le s \le T)\\ \mathbf{x}(t)=x\end{cases}$$ with $$P_{x,t}[\mathbf{\alpha}(\cdot)]=\int_t^T r(\mathbf{x}(s),\mathbf{\alpha}(s))ds + g(\mathbf{x}(T))$$


Consider this problem for all choices of starting times $0\le t \le T$ and all initial points $x\in \mathbb{R}^n$.

Definition: For $x\in \mathbb{R}^n$, $0\le t \le T$, define the value function $v(x, t)$ to be the greatest payoff possible if we start at $x\in \mathbb{R}^n$ at time $t$. In other words, $$v(x, t) := \sup_{\mathbf{\alpha}(\cdot)\in \mathcal{A}} P_{x,t} [\mathbf{\alpha}(\cdot)]\quad (x \in \mathbb{R}^n,\ 0\le t \le T)$$ Notice then that $$v(x,T)=g(x)$$

Definition: Partial differential equations Hamiltonian $$H(x,p):= \max_{a\in A} H(x,p,a)=\max_{a\in A} \{ \mathbf{f}(x,a)\cdot p+ r(x,a)\}$$ where $x, p\in \mathbb{R}^n$.

Theorem (Hamilton-Jacobi-Bellman Equation): Assume that the value function $v$ is a $C^1$ function of the variables $(x, t)$. Then $v$ solves the nonlinear partial differential equation $$v_t(x, t) + H(x,\nabla_x v)=0\quad (x\in \mathbb{R}^n,\ 0\le t < T)$$ with the terminal condition $$v(x,T)=g(x)\quad (x\in \mathbb{R}^n)$$

Proof)
1. Let $x\in\mathbb{R}^n$, $0\le t < T$ and let $h>0$ be given. As always $$\mathcal{A}=\{\mathbf{\alpha}(\cdot): [0,\infty)\rightarrow A\mbox{ measurable}\}$$
Pick any parameter $a\in A$ and use the constant control $$\mathbf{\alpha}(\cdot)\equiv a$$ for times $t\le s \le t+h$. The dynamics then arrive at the point $\mathbf{x}(t+h)$, where $t+h<T$. Suppose now a time $t+h$, we switch to an optimal control and use it for the remaining times $t+h\le s \le T$.
What is the payoff of this procedure? Now for times $t\le s \le t+h$, we have $$\begin{cases} \dot{\mathbf{x}}(s)=\mathbf{f}(\mathbf{x}(s),a)\\ \mathbf{x}(t)=x\end{cases}$$ The payoff for this time period is $\int_t^{t+h} r(\mathbf{x}(s),a)ds$. Furthermore, the payoff incurred from time $t+h$ to $T$ is $v(\mathbf{x}(t+h),t+h)$, according to the definition of the payoff function $v$. Hence the total payoff is $$\int_t^{t+h} r(\mathbf{x}(s),a)ds + v(\mathbf{x}(t+h),t+h)$$ But the greatest possible if we start from $(x,t)$ is $v(x, t)$. Therefore $$v(x,t)\ge \int_t^{t+h} r(\mathbf{x}(s),a)ds + v(\mathbf{x}(t+h),t+h)$$

2. We now want to convert this inequality into a differential form. So we rearrange above formula and divied by $h>0$: $$\frac{v(\mathbf{x}(t+h),t+h)-v(x,t)}{h}+\frac{1}{h}\int_t^{t+h} r(\mathbf{x}(s),a)ds \le 0$$ Let $h\rightarrow 0$: $$v_t(x,t)+\nabla_x v(\mathbf{x}(t),t)\cdot \dot{\mathbf{x}}(t)+r(\mathbf{x}(t),a)\le 0$$ But $\mathbf{x}(\cdot)$ solves the ODE $$\begin{cases} \dot{\mathbf{x}}(s)=\mathbf{f}(\mathbf{x}(s),a)& (t\le s \le t+h)\\ \mathbf{x}(t)=x\end{cases}$$ Employ this above, to discover: $$v_t(x,t)+\mathbf{f}(x,a)\cdot \nabla_x v(x,t) + r(x,a)\le 0$$ This inequality holds for all control parameters $a\in A$, and consequently $$\max_{a\in A} \{ v_t(x,t) + \mathbf{f}(x,a)\cdot \nabla_x v(x,t) + r(x,a)\}\le 0$$

3. We next demonstrate that in fact thje maximum above equals zero. To see this, suppose $\mathbf{\alpha}^*(\cdot)$, $\mathbf{x}^*(\cdot)$ were optimal for the problem above. Let us utilize the optimal control $\mathbf{\alpha}^*(\cdot)$ for $t\le s \le t+h$. The payoff is $$\int_t^{t+h} r(\mathbf{x}^*(s),\mathbf{\alpha}^*(s))ds$$ and the remaining payoff is $v(\mathbf{x}^*(s),\mathbf{\alpha}^*(s))ds + v(\mathbf{x}^*(t+h),t+h)=v(x,t)$ Rearrange and divide by $h$: $$\frac{v(\mathbf{x}^*(t+h),t+h)-v(x,t)}{h} + \frac{1}{h}\int_t^{t+h} r(\mathbf{x}^*(s), \mathbf{\alpha}^*(s))ds=0$$ Let $h\rightarrow 0$ and suppose $\mathbf{\alpha}^*(t)=a^*\in A$. Then $$v_t(x,t)+\nabla_x v(x,t)\cdot \dot{\mathbf{x}}^*(t)+r(x,a^*)=0$$ and therefore $$v_t(x,t)+\mathbf{f}(x,a^*)\cdot \nabla_x v(x,t) + r(x, a^*)=0$$ for some parameter value $a^*\in A$. This proves HJB.


The Dynamic Programming Method

1. Solve the Hamilton-Jacobi-Bellman equation, and thereby compute the value function $v$.
2. Use the value function $v$ and the Hamilton-Jacobi-Bellman PDE to design an optimal feedback control $\mathbf{\alpha}^*(\cdot)$, as follows. Define for each point $x\in \mathbb{R}^n$ and each time $0\le t \le T$, $$\mathbf{\alpha}(x, t) = a\in A$$ to be a parameter value where the maximum in HJB is attained. In other words, we select $\mathbf{\alpha}(x,t)$ so that $$v_t(x, t)+\mathbf{f}(x, \mathbf{\alpha}(x,t))\cdot \nabla_x v(x, t) + r(x, \mathbf{\alpha}(x, t))=0$$
3. Solve the following ODE, assuming $\mathbf{\alpha}(\cdot,t)$ is sufficiently regular to let us do so: $$\begin{cases}\dot{\mathbf{x}}^*(s)=\mathbf{f}(\mathbf{x}^*(s),\mathbf{\alpha}(\mathbf{x}^*(s),s))& (t\le s \le T)\\ \mathbf{x}(t)=x\end{cases}$$
Finally, define the feedback control $$\mathbf{\alpha}^*(s):=\mathbf{\alpha}(\mathbf{x}^*(s),s)$$


Theorem (Verification of Optimality): The control $\mathbf{\alpha}^*(\cdot)$ defined by the construction $\mathbf{\alpha}^*(s):=\mathbf{\alpha}(\mathbf{x}^*(s),s)$ is optimal. 
Proof)

The method of Charateristics: Assume $H:\mathbb{R}^n\times \mathbb{R}^n\rightarrow \mathbb{R}$ and consider this initial-Value problem for the Hammilton-Jacobi equation: $$ \begin{cases}u_t(x,t)+H(x,\nabla_x u(x,t))=0 & (x\in \mathbb{R}^n,0<t<T)\\ u(x,0)=g(x)\end{cases}$$ 

A basic idea in PDE theory is to introduce some ordinary differential equations,  the solution of which lets us compute the solution $u$. In particular, we want to find  a curve $\mathbf{x}(\cdot)$ along which we can, in principle at least, compute $u(x,t)$.


Theorem (Costates and Gradients): Assume $\mathbf{\alpha}^*(\cdot)$,$\mathbf{x}^*(\cdot)$ solve the control problem. If the value function $v$ is $C^2$, then the costate $\mathbf{p}^*(\cdot)$ occuring in the Maximum Principle is given by $$\mathbf{p}^*(s)=\nabla_x v(\mathbf{x}^*(s),s)\quad (t\le s \le T)$$


Exercise

1. This problem consists of two questions.
(a) Determine which of the following partial differential equations (a)-(d) corresponds to the following optimal control problem $$\min_{u} x(1)^2\quad \mbox{s.t.}\quad \begin{cases} \dot{x}=2u,& x(0)=x_0\\ \left| u \right| \le 1\end{cases}$$
(i) $-V_t=-2V_x \mbox{sign} (V_x),\ V(1,x)=x^2$
(ii) $-V_t=-2V_x \mbox{sign} (V_x),\ V(1,x)=2x$
(iii) $-V_t=-2V_x,\ V(1,x)=x^2$
(iv) $-V_t=-\sin(V_x)(1+2V_x),\ V(1,x)=x^2$
(b) Consider the optimal control problem $$J(x_0)=\min_u \int_0^1 f_{01}(x,u)dt+\int_1^2 f_{02}(x,u)dt$$
$$\mbox{s.t.}\quad \begin{cases}\dot{x}=f_1(x,u),& 0\le t \le 1, & x(0)=x_0\\ \dot{x}=f_2(x,u),& 1\le t \le 2 \end{cases}$$
Below are two attempts of solving the problem. They cannot both be correct (and may both be wrong). Find and explain the error(s) in the reasoning. Is any of two attempts correct?
Attempt 1
$$\begin{matrix} J^*(x_0) &=& \min_{u_1} \int_0^1 f_{01} (x_1,u_1) dt & +\min_{x_1(1)}\min_u \int_1^2 f_{02}(x_2,u_2)dt\\ & & \mbox{s.t. } \dot{x}_1=f_1(x_1,u_1),\ x_1(0)=x_0 &\mbox{s.t. } \dot{x}_2=f_2(x_2,u_2),\ x_2(1)=x_1(1)\\ &=& \min_{x_1(1)} J_1^*(x_0) + J_2^*(x_1(1))& &\end{matrix}$$
Attempt 2:
$$\begin{matrix} J^*(x_0) &=& \min_u \{ \int_0^1 f_{01} (x,u) dt & +\min_u \int_1^2 f_{02} (x,u) dt\} \\ & & \mbox{s.t. } \dot{x}=f_1(x,u),\ x(0)=x_0 & \mbox{s.t. } \dot{x}=f_2(x,u),\\ &=& \min_u\{ \int_0^1 f_{01} (x,u) dt + J_2^*(x(1))\} & \\ & & \mbox{s.t. } \dot{x}=f_1(x,u),\ x(0)=x_0 & \end{matrix}$$
where $$ \begin{matrix} J_k^*(x_0) &=& \min_u \int_{k-1}^k f_{0k} (x_k,u) dt \\ & &  \mbox{s.t. } \dot{x}_k = f_k (x_k,u),\ x_k(k-1)=x_0\end{matrix}$$
Answer
(a) Let $\alpha=u$, payoff is $$P[\alpha(\cdot)]=-\int_0^1 0dt - x(1)^2$$
Then $f(x,a)=2a$, $g=-x(1)^2$, $r(x,a)=0$ and therefore $$H(x,p,a)=f(x,a)p +r(x,a)=2ap$$
Use Hamilton-Jacobi-Bellman Equation, $$v_t(x,t)+\max_{a\in A} \{ f(x,a) \nabla_x v(x,t)+r(x,a)\}=0$$ then $$v_t(x,t)+\max_{\left| a \right| \le 1} \{ 2a \nabla_x v(x,t)\}=2v_x \mbox{sign}(V_x)=0$$ 
and $$v(x,1)=g(x) = -x^2$$
Answer is (a), but sign is strange. (maybe minimizing and maximizaing is switched.)
(b) Attempt 2 right. The optimal path of combination have to optimize whole period.
That's principle of optimality.
$\blacksquare$



2. Consider the following value function (cost-to-go function) $$V(t_0,x_0)=\max_{u\in \mathbb{R}} \int_{t_0}^T \sqrt{u(s)}ds\quad \mbox{s.t. } \dot{x}(t)=\beta x(t)-u(t),\ x(t_0)=x_0$$ where $\beta > 0$. Verify that $V(t,x)=f(t)\sqrt{x}$, where $$f(t)=\sqrt{\frac{e^{\beta(T-t)}-1}{\beta}}$$
Comment: Note that the value function is only defined on $[0,T]\times \mathbb{R}_+$, where $\mathbb{R}_+=(0,\infty)$ (i.e. $V:[0,T]\times \mathbb{R}_+ \rightarrow \mathbb{R}_+$. The theroems presented in the course are also valid when the domain is restricted to such a set.
Answer
Let $f(t, x(t),u(t))=\beta x(t)-u(t)$, $r(t,x(t),u(t))=\sqrt{u(t)}$, $g(x(T))=0$ , then, $$\begin{cases} V_t(x,t)+\max_u \{ (\beta x(t) -u(t)) V_x(x,t) + \sqrt{u(t)}\}=0 \\ V(x,T)=0\end{cases}$$
And use $V(t,x)=f(t)\sqrt{x}$, then $$\begin{cases}  f_t(t)\sqrt{x}+\max_u \{ (\beta x(t)-u(t)) \frac{1}{2\sqrt{x}}f(t)+\sqrt{u(t)}\}=0\\ f(T)=0\end{cases}$$
To be max, for each $t$, $$-\sqrt{u}\frac{1}{\sqrt{x}}f+1=0,\ \sqrt{u}=\frac{\sqrt{x}}{f}$$
Equation became $$\dot{f}(t)\sqrt{x}+(\beta x(t)-\frac{x(t)}{f(t)^2})\frac{1}{2\sqrt{x}} f(t)+\frac{\sqrt{x}}{f(t)}=0$$
Then $$\dot{f}(t) +\frac{1}{2}\beta f(t) + \frac{1}{2f(t)} =0, \dot{f}(t)=\frac{\beta}{2}f(t)-\frac{1}{2f(t)}$$
With god(WolframAlpha), $$f(t)=\pm \frac{\sqrt{e^{\beta(T-t)}-1}}{\sqrt{\beta}}$$
Use region condition, Value function become $$V(x,t)=\sqrt{x} \frac{\sqrt{e^{\beta(T-t)}-1}}{\sqrt{\beta}}$$
$\blacksquare$



3. Consider the nonlinear optimal control problem $$\min x(1)^2 + \int_0^1 (x(t)u(t))^2dt\quad \mbox{subj. to } \dot{x}(t)=x(t)u(t),\ x(0)=1$$
Solve the problem using dynamic programming.
Hint: Use the ansatz $V(t,x)=p(t)x^2$.
Answer
Step 1:
Let $\alpha=u$, payoff is $$P[\alpha(\cdot)]=-\int_0^1x^2a^2 dt -x(1)^2$$
Then $f(x,a)=xa$, $g=-x(1)^2$, $r(x,a)=-x^2a^2$ and therefore $$H(x,v_x,a)=f(x,a)v_x+r(x,a)=xav_x-x^2a^2$$
Use Hamilton-Jacobi-Bellman Equation, $$v_t(x,t)+\max_{a\in \mathbb{R}} \{xap-x^2a^2\}$$
Each time $t$ the control value $\alpha(t)$ must be selected to maximize $xav_x-x^2a^2$, then $$xv_x-2x^2a=0,\ \alpha(t)=\frac{v_x}{2x}$$
Then PDE become $$v_t+\frac{1}{4}v_x^2=0$$
Use hint, $V(t,x)=p(t)x^2$, $$\dot{p}(t)x^2+p(t)^2x^2=0,\ p(t)=-\frac{1}{t-C},\ V(t,x)=-\frac{x^2}{t-C}$$
Terminal condition is $$V(1,x)=-x^2$$
Then finally, $$V(t,x)=-\frac{x^2}{2-t}$$
Step 2: 
HJB become $$-\frac{x^2}{(2-t)^2} + x\alpha(x,t) \frac{2x}{2-t} - x^2\alpha(x,t)^2=0$$
Control became $$\alpha(x(t),t)=\frac{1}{t-2}$$
The ODE become $$\begin{cases}\dot{\mathbf{x}}^*(s)=\frac{\mathbf{x}^*(s)}{s-2}\\ \mathbf{x}(t)=x\end{cases}$$
(We don't need to solve this. there is no $x$ dependance on $\alpha$.
Finally, feedback control is $$\mathbf{\alpha}^*(s):=\mathbf{\alpha}(\mathbf{x}^*(s),s)=\frac{1}{s-2}$$
$\blacksquare$



5. The purpose of this problem is investigate continuous time dynamic programming applied to optimal control problems with discounted cost and apply it to an investment problem. 
(a) Consider the optimal control problem $$\min_u e^{-\alpha T} \mathbf{\Phi}(x(T)) + \int_0^T e^{-\alpha t} f_0 (t, x(t), u(t))dt \quad \mbox{subj.to. } \dot{x}(t)=f(t,x(t),u(t)),\ x(t_0)=x_0$$
Let $V(x,t)=e^{-\alpha t} W(x,t)$. Use the Hamilton-Jacobi-Bellman Equation (HJBE) to derive a new (HJBE) for $W(t,x)$ (the discounted cost HJBE).
(b) Solve the following investment problem $$\max_u \int_0^T e^{-\alpha t} \sqrt{u(t)} dt \quad \mbox{subj. to } \dot{x}(t)=\beta x(t) - u(t),\ x(0) = x_0$$ where $x_0$ is the initial amount of savings, $u$ is the rate of expenditure, and $\beta$ is the interest rate on the savings.
Hint: You may use problem (a) with $W(t,x)=w(t)\sqrt{x}$, where $w(t)$ is a function to be deteremined.
Answer
(a)
HJBE is $$\begin{cases}V_t(x,t) + \max_u \{f(t, x(t), u(t)) V_x(x,t) + r(t, x(t), u(t))\}\\ V(x,T)=g(x(T))\end{cases}=0$$
Set $f=f$, $g=-e^{-\alpha T} \mathbf{\Phi} (x)$, $r=-e^{-\alpha t} f_0$, then $$\begin{cases} V_t(x,t)+\max_u \{ f V_x -e^{-\alpha t} f_0\}=0\\ V(x,T)=-e^{-\alpha T} \mathbf{\Phi}(x)\end{cases}$$
Let $V(x,t)=e^{-\alpha t} W(x,t)$, then HJBE become $$\begin{cases} -\alpha W(x,t)+W_t(x,t) + \max_u \{f(t,x(t),u(t))W_x(x,t)-f_0(t,x(t),u(t))\}=0\\ W(x,T)=\mathbf{\Phi}(x)\end{cases}$$
(b)
Step1:
Use (a) with $f(t, x(t),u(t))=\beta x(t)-u(t)$, $f_0(t,x(t),u(t))=-\sqrt{u(t)}$, $\mathbf{\Phi}(x(T))=0$ , then, $$\begin{cases} -\alpha W(x,t)+W_t(x,t)+\max_u \{ (\beta x(t) -u(t)) W_x(x,t) + \sqrt{u(t)}\}=0 \\ W(x,T)=0\end{cases}$$
Use hint, $W(t,x)=w(t)\sqrt{x}$, then $$\begin{cases} -\alpha w(t) \sqrt{x} +w_t(t)\sqrt{x}+\max_u \{ (\beta x(t)-u(t)) \frac{1}{2\sqrt{x}}w(t)+\sqrt{u(t)}\}=0\\ w(T)=0\end{cases}$$
To be max, for each $t$, $$-\sqrt{u}\frac{1}{\sqrt{x}}w+1=0,\ \sqrt{u}=\frac{\sqrt{x}}{w}$$
Equation became $$-\alpha w(t)\sqrt{x}+\dot{w}(t)\sqrt{x}+(\beta x(t)-\frac{x(t)}{w(t)^2})\frac{1}{2\sqrt{x}} w(t)+\frac{\sqrt{x}}{w(t)}=0$$
Then $$-\alpha w(t) +\dot{w}(t) +\frac{1}{2}\beta w(t) + \frac{1}{2w(t)} =0, \dot{w}(t)=(\alpha-\frac{\beta}{2})w(t)-\frac{1}{2w(t)}$$
With god(WolframAlpha), $$w(t)= \frac{\sqrt{1-e^{(2\alpha-\beta)(t-T)}}}{\sqrt{2\alpha-\beta}}$$
Value function become $$V(x,t)= e^{-\alpha t}\sqrt{x}\frac{\sqrt{1-e^{(2\alpha-\beta)(t-T)}}}{\sqrt{2\alpha-\beta}}$$
Step 2:
HJB become (above)
Control become.. $$ u = \frac{x}{w^2} = \frac{x(2\alpha - \beta)}{1-e^{(2\alpha -\beta)(t-T)}}$$
ODE become.. $$\dot{x}(t)=\beta x(t) -\frac{x(2\alpha - \beta)}{1-e^{(2\alpha -\beta)(t-T)}} $$
Solve, then..
Finally, feedback control is..
$\blacksquare$

End!

Reference