Solving a class of stochastic optimal control problems by physics-informed neural networks¹¹1This research was partially supported by the National Natural Science Foundation of China (12272297).

Zhe Jiao²²2Equal contribution. [email protected] Wantao Jia³³3Equal contribution. [email protected] Weiqiu Zhu [email protected]

Abstract

The aim of this work is to develop a deep learning method for solving high-dimensional stochastic control problems based on the Hamilton–Jacobi–Bellman (HJB) equation and physics-informed learning. Our approach is to parameterize the feedback control and the value function using a decoupled neural network with multiple outputs. We train this network by using a loss function with penalty terms that enforce the HJB equation along the sampled trajectories generated by the controlled system. More significantly, numerical results on various applications are carried out to demonstrate that the proposed approach is efficient and applicable.

keywords:

Stochastic optimal control , High dimension , Hamilton–Jacobi–Bellman equation , Physics-informed learning

\affiliation

[inst1]organization=School of Mathematics and Statistics, addressline=Northwestern Polytechnical University, city=Xi’an, postcode=710129, country=China

\affiliation

[inst2]organization=MOE Key Laboratory for Complexity Science in Aerospace,addressline=Northwestern Polytechnical University, city=Xi’an, postcode=710129, country=China

\affiliation

[inst3]organization=State Key Laboratory of Fluid Power and Mechatronic Systems, Department of Mechanics,addressline=Zhejiang University, city=Hangzhou, postcode=310027, country=China

1 Introduction

The range of stochastic optimal control (SOC) problems covers a variety of scientific branches such as finance [1], molecular dynamics [2], neuroscience [3] and robotics [4]. To address SOC problems, there are two prominent frameworks: Pontryagin’s maximum principle (MP) [5] and Bellman’s dynamic programming (DP) [6]. Drawing on these frameworks, many numerical methods have been developed for tackling SOC problems (cf. [7, 8] and references therein).

However, these traditional numerical methods are not applicable when the state dimension is large [9]. In recent years, there has seen significant progress in leveraging deep learning (DL) to solve the high-dimensional SOC problems [10, 11, 12, 13, 14]. Broadly speaking, the deep neural network-based methods for SOC can be divided into two distinct categories. In the first category, it is concerned with the DL-based approach to solve the extended Hamiltonian system, which is derived from stochastic MP (cf. [15, 16, 17]). For the study of the second category, [18, 19] reformulate the SOC problem as Markov decision process based on DP, which is solved by some DL-based algorithms. Another direction of this category is to solve the SOC problem from the view of DP via HJB equation [20, 21, 22]. We need to point out that in these papers Feynman–Kac formula is the basis to probabilistically represent the solution to HJB equation so that the author can utilizes neural networks to obtain the optimal policy.

Motivated by previous research, we aim to solve the SOC problem with physics-informed learning [23, 24]. The main issue we encounter in our approach is to construct a physics-informed neural network (PINN) for solving HJB equation, which is a semilinear parabolic partial differential equation (PDE) with a terminal value condition. Since the HJB equation is defined on the whole space, without boundary condition, PINN cannot be directly used to compute the value function by solving the HJB equation. Thanks to the stochastic verification theorem (see Theorem 1 in Section 2.2), we can simulate the value function along the trajectories of the controlled system, not on the whole space, by neural network. This is the key idea of our approach.

Our main contribution is twofold: (i) In contrast to [13], we use the controlled SDE to conduct sampling on relevant states during PINN training; (ii) We propose a simulation-free algorithm for SOC by physics-informed learning, which means it dose not require numerical solutions of the control problem.

The remaining part of this paper is organized as follows. In Section 2, we briefly introduce the preliminaries about the SOC problem, and the verification theorem that is the basis to construct our DL based solver. This solver called DeepHJB is proposed in Section 3. Numerical examples in Section 4 illustrate our proposed solver to solve some SOC problems. Section 5 provides some conclusions.

2 Stochastic optimal control

2.1 Problem setup

Let $T>t\geqslant 0$ and $\textbf{W}:[t,T]\times\Omega\rightarrow\mathbb{R}^{d}$ be a $d$ -dimensional standard $\mathbb{F}$ -Brownian motions on a filtered probability space $(\Omega,\mathcal{F},\mathbb{F},\mathbb{P})$ where $\mathbb{F}=\{\mathcal{F}_{s}\}_{t\leqslant s\leqslant T}$ is the natural filtration generated by $\textbf{W}(s)$ . The quadruple $(\Omega,\mathcal{F},\mathbb{F},\mathbb{P})$ also satisfies the usual hypotheses (see Chapter 1.4 in [25]). $\mathbb{E}[\cdot]$ stands for expectation with respect to the probability measure $\mathbb{P}$ .

We consider the controlled stochastic differential equation (SDE) as follows

\mathrm{d}\textbf{x}_{s}=b(s,\textbf{x}_{s},\textbf{u}(s))\mathrm{d}s+\sigma(s% ,\textbf{x}_{s},\textbf{u}(s))\mathrm{d}\textbf{W}(s)

(1)

with $s\in[t_{0},T]$ and the initial data $\textbf{x}_{t_{0}}=x\in\mathbb{R}^{n}$ . Here, $\textbf{x}_{s}\in\mathbb{R}^{n}$ is the state process, $\textbf{u}(s)\in\mathbb{R}^{m}$ is a control process valued in a given subset $U$ of $\mathbb{R}^{m}$ . The cost functional is given by

\displaystyle J(t_{0},x;\textbf{u}(t))=\mathbb{E}\left[\int_{t_{0}}^{T}\phi(s,% \textbf{x}_{s},\textbf{u}(s))\mathrm{d}s+\psi(\textbf{x}_{T})|\textbf{x}_{t}=x\right]

(2)

with the functions $\phi:[t_{0},T]\times\mathbb{R}^{n}\times\mathbb{R}^{m}\rightarrow\mathbb{R}$ and $\psi:\mathbb{R}^{n}\rightarrow\mathbb{R}$ . The goal of our SOC problem is to look for an admissible control (if exists) that minimizes (2) over $\mathcal{U}$ which is the set of all admissible controls defined by

\mathcal{U}:=\left\{u:[t_{0},T]\times\Omega\rightarrow U|u(s)\in L^{2}_{% \mathbb{F}}(t_{0},T;\mathbb{R}^{m})\right\}

in which $L^{2}_{\mathbb{F}}(t,T;\mathbb{R}^{m})$ consists of all $\mathbb{F}$ -adapted functions $u:[t_{0},T]\times\Omega\rightarrow\mathbb{R}^{m}$ satisfying $\mathbb{E}[\int_{t_{0}}^{T}|u|^{2}\mathrm{d}s]<\infty$ .

In this paper, we focus on the SOC problem under the following conditions.

The drift term $b\in\mathbb{R}^{n}$ and the diffusion term $\sigma\in\mathbb{R}^{n\times(1+d)}$ in (1) have the following linear forms in control

b(s,\textbf{x}_{s},\textbf{u}(s))=A(s,\textbf{x}_{s})+B(s,\textbf{x}_{s})% \textbf{u}(s),

and

\sigma(s,\textbf{x}_{s},\textbf{u}(s))=[\lambda B(s,\textbf{x}_{s})\textbf{u}(% s),C(s,\textbf{x}_{s})]

with $\lambda\geqslant 0$ , $A\in\mathbb{R}^{n}$ , $B\in\mathbb{R}^{n\times m}$ and $C\in\mathbb{R}^{n\times d}$ .

2.

The random term $\textbf{W}_{s}=[w^{(1)}_{s},w^{(2)}_{s}]\in\mathbb{R}^{(1+d)}$ in which $w^{(1)}_{s}\in\mathbb{R}^{1}$ and $w^{(2)}_{s}\in\mathbb{R}^{d}$ are mutually independent Brownian motions.

The running cost in (2) is quadratic, that is,

\phi(s,\textbf{x}_{s},\textbf{u}(s))=\textbf{x}_{s}^{\top}F\textbf{x}_{s}+% \frac{1}{2}\textbf{u}(s)^{\top}D\textbf{u}(s)

with the coefficients $F\in\mathbb{R}^{n\times n}$ and $D\in\mathbb{R}^{m\times m}$ .

The terminal cost is linear

\psi(x)=\gamma\cdot x

or quadratic

\psi(x)=(x-\textbf{x}_{T})^{\top}F_{T}(x-\textbf{x}_{T})

with the coefficients $\gamma\in\mathbb{R}^{n}$ and $F_{T}\in\mathbb{R}^{n\times n}$ .

Under suitable assumptions (see Chapter 1 in [26]), for any $\textbf{u}(s)\in\mathcal{U}$ equation (1) has a unique solution $\textbf{x}_{s}$ and the cost function (2) is well-defined. We call $(\textbf{x}_{s},\textbf{u}(s))$ an admissible pair. Any $\textbf{u}^{\ast}(s)$ is called an optimal control if it satisfies

\textbf{u}^{\ast}(s):=\mathop{\arg\min}\limits_{\mathbf{u}(s)\in\mathcal{U}}J(% t,x;\textbf{u}(s)).

The corresponding state process $\textbf{x}^{\ast}_{s}$ is called an optimal trajectory and the state-control pair $(\textbf{x}^{\ast}_{s},\textbf{u}^{\ast}(s))$ called an optimal pair.

2.2 Verification theorem

We define the value function as

q(t,x):=J(t,x;\textbf{u}^{\ast}(s))=\min\limits_{\mathbf{u}(s)\in\mathcal{U}}J% (t,x;\textbf{u}(s)).

The following theorem shows the evolution of the value function along the optimal trajectory and at the the final time, which is deduced from the stochastic verification theorem (see Theorem 5.1 in Chapter 5.5 of [26]). The detailed proof is given in A.

Refer to caption — Figure 1: A decoupled neural network structure for solving HJB equation (6) with multiple outputs. This neural network is a type of PINNs which integrate the information from PDEs (6) into the loss function of a neural network using automatic differentiation. The architecture of the decoupled hidden layers are denoted by $\mathcal{A}_{i}$ , $i=1,2$ , which will be given in B.

Theorem 1.

An admissible pair $(\mathbf{x}_{t},\mathbf{u}(t))$ , where the feedback control $\mathbf{u}(t)$ is given by

\mathbf{u}(t)=-\tilde{D}(t,\mathbf{x}_{t})^{-1}B(t,\mathbf{x}_{t})^{\top}% \nabla q(t,\mathbf{x}_{t})

(3)

with

\tilde{D}(t,\mathbf{x}_{t})=D+\lambda^{2}B(t,\mathbf{x}_{t})^{\top}\nabla^{2}q% (t,\mathbf{x}_{t})B(t,\mathbf{x}_{t}),

is optimal if and only if the following HJB equation holds

$\displaystyle-\partial_{t}q(t,\mathbf{x}_{t})=$	$\displaystyle H\left(t,\mathbf{x}_{t},\mathbf{u}(t),\nabla q(t,\mathbf{x}_{t})% ,\nabla^{2}q(t,\mathbf{x}_{t})\right)$	(4)
$\displaystyle=$	$\displaystyle\frac{1}{2}\mathrm{tr}\left[C(t,\mathbf{x}_{t})^{\top}\nabla^{2}% qC(t,\mathbf{x}_{t})\right]$
	$\displaystyle-\frac{1}{2}(\nabla q)^{\top}\left[B(t,\mathbf{x}_{t})\tilde{D}^{% -1}B(t,\mathbf{x}_{t})^{\top}\right]\nabla q$
	$\displaystyle+A(t,\mathbf{x}_{t})\cdot\nabla q+\mathbf{x}_{t}^{\top}F\mathbf{x% }_{t}$

for any $t\in[0,T)$ , and $q(T,\mathbf{x}_{T})=\psi(\mathbf{x}_{T})$ .

Here, $\partial_{t}q$ means the first-order derivative of $q$ with respect to $t$ , $\nabla q$ and $\nabla^{2}q$ respectively denote the gradient and the Hessian of $q(t,x)$ with respect to $x$ , and $\mathrm{tr}$ is the abbreviation of the trace operator.

3 Deep learning approach

In this section we propose our approach to seek an optimal pair that minimizes the cost functional (2) subject to (1) for initial data sampled from a probability distribution in $\mathbb{R}^{n}$ with a density denoted by $\rho$ .

We select a partition of the time interval $[0,T]$ :

0=t_{0}<t_{1}<\cdots<t_{n}<\cdots<t_{N}=T,

and denote by $\triangle t_{n}=t_{n+1}-t_{n}$ the $(i+1)$ th interval of the grid and $\triangle W_{n}=W_{t_{n+1}}-W_{t_{n}}$ the $(i+1)$ -th increment of the Brownian motion. Once the control $u(t)$ is computed, the Euler–Maruyama scheme (cf. [27]) of (1) gives

\displaystyle x_{t_{n+1}}-x_{t_{n}}=b(t_{n},x_{t_{n}},u(t_{n}))\triangle t_{n}% +\sigma(t_{n},x_{t_{n}},u(t_{n}))\triangle W_{n}

(5)

with the initial data $x_{t_{0}}=x\sim\rho$ . Using the numerical scheme (5), the path $\{(t_{n},x_{t_{n}})\}_{0\leqslant n\leqslant N}$ can be easily generated. If the value function $q(t,x_{t})$ is known and the discretization of the admissible pair $\{(x_{t_{n}},u(t_{n}))\}_{0\leqslant n\leqslant N}$ satisfies

\left\{\begin{array}[]{ll}-\partial_{t}q(t_{n},x_{t_{n}})=H\left(t_{n},x_{t_{n% }},u(t_{n}),\nabla q(t_{n},x_{t_{n}}),\nabla^{2}q(t_{n},x_{t_{n}})\right),\\ q\left(t_{N},x_{t_{N}}\right)=\psi\left(x_{t_{N}}\right),\end{array}\right.

(6)

from (4) in Theorem 1 we know $\{(x_{t_{n}},u(t_{n}))\}_{0\leqslant n\leqslant N}$ is an optimal pair.

Our approach parameterizes the functions $u$ and $q$ by a decoupled neural network (Figure 1), which are given by

u^{\textrm{NN}}(t_{n},x_{t_{n}};\theta_{u}),\quad q^{\textrm{NN}}(t_{n},x_{t_{% n}};\theta_{q}).

We denote by $\Theta=\{\theta_{u},\theta_{q}\}$ the weights of the neural network, which is trained by minimizing the sum of the expected losses that arises from the following penalty terms.

The second-order HJB penalty terms are defined as

\displaystyle\mathcal{L}_{1}(\theta_{u},\theta_{q})=\left|\partial_{t}q^{% \textrm{NN}}+H\left(t_{n},x_{t_{n}},u^{\textrm{NN}},\nabla q^{\textrm{NN}},% \nabla^{2}q^{\textrm{NN}}\right)\right|

and

\displaystyle\mathcal{L}_{2}(\theta_{q})=

\displaystyle\left|q^{\textrm{NN}}\left(t_{N},x_{t_{N}};\theta_{q}\right)-\psi% \left(x_{t_{N}}\right)\right|,

where $\mathcal{L}_{1}$ and $\mathcal{L}_{2}$ are derived from (6).

Another penalty term is given as follows

\mathcal{L}_{3}(\theta_{u},\theta_{q})=\left|u^{\textrm{NN}}+\tilde{D}^{-1}B^{% \top}(t_{n},x_{t_{n}})\nabla q^{\textrm{NN}}\right|

with

\tilde{D}=D+\lambda^{2}B(t_{n},x_{t_{n}})^{\top}\nabla^{2}q^{\textrm{NN}}(t_{n% },x_{t_{n}};\theta_{q})B(t_{n},x_{t_{n}}),

where $\mathcal{L}_{3}$ is from (3).

Now, we can define the physics-informed learning problem

\min_{\Theta}\mathbb{E}_{x\sim\rho}\left\{\alpha_{1}\mathcal{L}_{1}+\alpha_{2}% \mathcal{L}_{2}+\alpha_{3}\mathcal{L}_{3}\right\}.

The coefficients $\alpha_{1}>0$ , $\alpha_{2}>0$ and $\alpha_{3}>0$ are supposed to be fixed.

Finally, we apply a SGD-type algorithm to optimize the parameter $\Theta$ . The pseudo-code for implementing the above approach is given in Algorithm 1.

Algorithm 1 DeepHJB solver

Input: the initial data $\{(t_{0},x^{(i)}_{t_{0}})\}_{1\leqslant i\leqslant M}$ , parameter $N$ of partition, parameters $\theta^{(0)}$ of networks, learning rate $\eta$ , max-step $K$

For $k=0$ to $K-1$ do

For $i=1$ to $M$ do

For $n=0$ to $N$ do

$q^{\textrm{NN}}(t_{n},x^{(i)}_{t_{n}};\theta_{q}^{(k)})$

$u^{\textrm{NN}}(t_{n},x^{(i)}_{t_{n}};\theta_{u}^{(k)})$

while $n+1\leqslant N$ do

$\triangle t_{n}=t_{n+1}-t_{n}$

$\triangle W^{(i)}_{n}=W^{(i)}_{t_{n+1}}-W^{(i)}_{t_{n}}$

$x^{(i)}_{t_{n+1}}=x^{(i)}_{t_{n}}+b(t_{n},x^{(i)}_{t_{n}},u^{\textrm{NN}})% \triangle t_{n}+\sigma(t_{n},x^{(i)}_{t_{n}},u^{\textrm{NN}})\triangle W^{(i)}% _{n}$

end while

end for

$\Theta^{(k)}=(\theta_{u}^{(k)},\theta_{q}^{(k)})$

random set $B_{k}\subset\{1,2,\cdots,M\}$

$\mathrm{Loss}=\frac{1}{|B_{k}|}\sum\limits_{i\in B_{k}}\left\{\alpha_{1}% \mathcal{L}^{(i)}_{1}(\Theta^{(k)})+\alpha_{2}\mathcal{L}^{(i)}_{2}(\Theta^{(k% )})+\alpha_{3}\mathcal{L}^{(i)}_{3}(\Theta^{(k)})\right\}$

$\Theta^{(k+1)}=\Theta^{(k)}-\eta\nabla\mathrm{Loss}$

end for

4 Numerical experiments

In this section, we apply the DeepHJB solver to some SOC problems. In the following subsections, we discuss the controlled Ornstein–Uhlenbeck (OU) dynamics and the controlled metastable dynamics, respectively. To evaluate the proposed solver, we introduce the following $L^{2}$ error as the performance metric

\mathbb{E}\left[\int_{0}^{T}|u^{\textrm{NN}}-u^{\ast}|^{2}(t,x_{t})ds\right],

where $u^{\ast}$ is the baseline optimal control. The detailed configurations of these experiments can be seen in B.

4.1 Ornstein–Uhlenbeck dynamics

We investigate the controlled system with

\begin{split}A&=-I_{n\times n}+(\xi_{ij})_{1\leqslant i,j\leqslant n},\\ B=C&=I_{n\times n}+(\xi_{ij})_{1\leqslant i,j\leqslant n},\quad\lambda=0,\end{split}

where $\xi_{ij}\sim\mathcal{N}(0,0.01)$ are sampled once at the beginning of the experiments.

For the SOC problem with linear terminal cost, we choose

F=0,\quad D=I_{n\times n},\quad\gamma=(1,\cdots,1)^{\top}.

In this situation, the optimal control can be given analytically by

u^{\ast}(t)=-B^{\top}e^{A^{\top}(T-t)}\gamma,

which has been calculated in [21]. We set the initial value to be zero and the terminal time $T=1.0$ . In Figure 2, the subfigure (a) gives a visible comparison of the optimal control between $u^{\textrm{NN}}$ calculated by the DeepHJB solver and the baseline $u^{\ast}$ , while the subfigure (b) shows the evolution of the error $L^{2}$ against the iteration step. It can be seen that the optimal control approximated by the proposed solver well coincides with the analytical one.

Regarding the case with a quadratic terminal cost, we choose

\begin{split}F=\frac{1}{2}I_{n\times n},\quad D=I_{n\times n},\quad F_{T}=I_{n% \times n}.\end{split}

This type of problems has an analytic optimal control

\mathbf{u}^{\ast}(t,x)=-2B^{\top}P_{t}x

in which $P_{t}$ fulfills the Riccati equation

\frac{d}{dt}P_{t}+A^{\top}P_{t}+P_{t}A-2P_{t}BB^{\top}P_{t}+F=0

with $P_{T}=F_{T}$ (see [26, Chapter 6]). We choose the initial value from a pre-specified distribution ? and the terminal time $T=0.5$ . Figure 3 displays the direct comparison and $L^{2}$ error between the approximation $u^{\textrm{NN}}$ and the baseline $u^{\ast}$ of the solution to this SOC problem, which illustrates the accuracy of our DeepHJB solver.

4.2 Metastable dynamics

We consider the double well

\Psi(x)=\sum_{i=1}^{n}\kappa_{i}(x_{i}^{2}-1)^{2},\quad\kappa_{i}>0.

and the controlled system with

A(\mathbf{x}_{t})=-\nabla\Psi,\quad B=C=I_{n\times n},\quad\lambda=0.

The initial states in this experiment are $(-1,\cdots,-1)^{\top}$ , and the terminal state is set as $(1,\cdots,1)^{\top}$ . As for the cost functional, we choose

F=0,\quad D=I_{n\times n},\quad F_{T}=\mathrm{diag}\{\nu_{1},\cdots,\nu_{i},% \cdots,\nu_{n}\}

and the terminal time $T=1.0$ .

Firstly, we study the one-dimensional setting, choosing $\kappa=3$ , $\nu=1$ . Figure 4 displays the approximation $u^{\textrm{NN}}$ of the optimal control computed by the DeepHJB solver and the baseline $u^{\ast}$ obtained by a finite difference method. The absolute error between them can be seen in Figure 5. It is clear that the approximation is in close agreement with the baseline. Figure 6 demonstrates the growth of the potential function from an original potential to the optimal potential.

Let us next consider the high-dimensional case, that is, $n=5$ . In particular, we set $\kappa_{i}=1.2$ , $\nu_{i}=1$ for $i\in\{1,2,3\}$ and $\kappa_{i}=1$ , $\nu_{i}=1$ for $i\in\{4,5\}$ . As can be seen, Figure 7 shows two components of the five dimensional approximated optimal control $u^{\textrm{NN}}$ as well as the baseline $u^{\ast}$ , which indicates a good match and illustrates the efficacy of our DeepHJB solver for solving a high dimensional nonlinear SOC problem.

5 Conclusion

In this paper, we proposed the DeepHJB solver to study the finite time horizon SOC problems for a class of dynamical systems. Although these numerical experiments in this work demonstrate the efficacy of the solver, it still has plenty of room for development. From the viewpoint of theoretical analysis, our future research will be devoted to connecting Lyapunov analysis with the DeepHJB solver, and doing error analysis for the solver. Moreover, we will also exploit the present solver to investigate more control problems of high-dimensional nonlinear systems in practical applications.

Appendix A Proof of Theorem 1

From Theorem 5.1 in Chapter 5.5 of [26], we know the fact that an admissible pair $(\mathbf{x}_{t},\mathbf{u}(t))$ is optimal is equivalent to the condition that this pair satisfies the following HJB equation

$\displaystyle-\partial_{t}q(t,\mathbf{x}_{t})=$	$\displaystyle H\left(t,\mathbf{x}_{t},\mathbf{u}(t),\nabla q(t,\mathbf{x}_{t})% ,\nabla^{2}q(t,\mathbf{x}_{t})\right)$	(7)
$\displaystyle=$	$\displaystyle\frac{1}{2}\mathrm{tr}\left[\sigma(t,\mathbf{x}_{t},\mathbf{u}(t)% )^{\top}\nabla^{2}q(t,\mathbf{x}_{t})\sigma(t,\mathbf{x}_{t},\mathbf{u}(t))\right]$
	$\displaystyle+b(t,\mathbf{x}_{t},\mathbf{u}(t))\cdot\nabla q(t,\mathbf{x}_{t})% +\phi(t,\mathbf{x}_{t},\mathbf{u}(t))$

and

\mathbf{u}(t)=\mathop{\arg\min}\limits_{\mathrm{u}\in\mathcal{U}}\Big{\{}\frac% {1}{2}\mathrm{tr}\left[\sigma^{\top}\nabla^{2}q\sigma\right]+b\cdot\nabla q+% \phi\Big{\}}.

Due to the specific expression of $b$ , $\sigma$ and $\phi$ , we have

\mathbf{u}(t)=\mathop{\arg\min}\limits_{u\in\mathcal{U}}\Lambda(u)

with

	$\displaystyle\Lambda(u):=$	$\displaystyle\frac{1}{2}\lambda^{2}\mathrm{tr}\left[(B(t,\mathbf{x}_{t})u(t))^% {\top}\nabla^{2}q(t,\mathbf{x}_{t})B(t,\mathbf{x}_{t})u(t)\right]$
		$\displaystyle+B(t,\mathbf{x}_{t})u(t)\cdot\nabla q(t,\mathbf{x}_{t})+\frac{1}{% 2}u(t)^{\top}Du(t).$

Since we have $\frac{d\Lambda}{du}\big{|}_{u=\textbf{u}}=0$ , that is,

\lambda^{2}B^{\top}\nabla^{2}qB\textbf{u}+B^{\top}\nabla q+D\textbf{u}=0,

then we have

\textbf{u}=-(D+\lambda^{2}B^{\top}\nabla^{2}qB)^{-1}B^{\top}\nabla q.

(8)

Plugging the expression of the optimal control (8) into equation (7), we obtain the desired equation (4).

Appendix B Experiment configuration

We introduce the following fully connected feedforward neural network

	$\displaystyle z^{(1)}(x,\theta)$	$\displaystyle=W^{(1)}x+b^{(1)},$
	$\displaystyle\bar{z}^{(l)}(x,\theta)$	$\displaystyle=\sigma(z^{(l)}(x,\theta)),\quad l=1,2,\cdots,L-1,$
	$\displaystyle z^{(l+1)}(x,\theta)$	$\displaystyle=W^{(l+1)}\bar{z}^{(l)}(x,\theta)+b^{(l+1)},\quad l=1,2,\cdots,L-1,$

where we refer to $\sigma:\mathbb{R}\rightarrow\mathbb{R}$ as the activation function, to $L$ as the number of layers, and to $N_{0}$ , $N_{L}$ , and $N_{l}$ as he number of neurons in the input, output, and $l$ -th hidden layer, respectively. We denote by $\mathcal{A}=(N,\sigma)$ , $N=(N_{0},N_{1},\cdots,N_{L})\in\mathbb{N}^{L+1}$ , the architecture of the neural network.

The computational framework for our numerical examples is conducted by the following architecture.

Controlled OU dynamics in Section 4.1.

(a)

For the linear terminal cost, the architecture is given by

\begin{split}\mathcal{A}_{1}&=((31,32,32,32,1),\tanh),\\ \mathcal{A}_{2}&=((31,32,32,32,30),\tanh),\end{split}

while for the quadratic terminal cost it is given by

\begin{split}\mathcal{A}_{1}&=((16,64,64,64,1),\tanh),\\ \mathcal{A}_{2}&=((16,64,64,64,15),\tanh).\end{split}

Controlled metastable dynamics in Section 4.2.

(a)

For one-dimensional case, the architecture is given by

\begin{split}\mathcal{A}_{1}&=((2,128,128,128,128,1),\tanh),\\ \mathcal{A}_{2}&=((2,128,128,128,128,1),\tanh),\end{split}

while for ten-dimensional case it is given by

\begin{split}\mathcal{A}_{1}&=((6,128,128,128,128,1),\tanh),\\ \mathcal{A}_{2}&=((6,128,128,128,128,5),\tanh).\end{split}

The computing device that we use for our solver includes a single NVIDIA GeForce RTX 2080Ti GPU with 11GB memory. Codes will be publicly available at https://github.com/zhezhejiao/DeepHJB after being accepted.

References

[1] H. Pham, Continuous-time stochastic control and optimization with financial applications, Vol. 61, Springer Science & Business Media, 2009.
[2] Y. Gao, T. Li, X. Li, J.-G. Liu, Transition path theory for langevin dynamics on manifolds: Optimal control and data-driven solver, Multiscale Modeling & Simulation 21 (1) (2023) 1–33.
[3] E. Todorov, Optimality principles in sensorimotor control, Nature Neuroscience 7 (2004) 907–915.
[4] T. Russ, Robotic Manipulation: Perception, Planning, and Control, Draft textbook, 2023.
[5] L. S. Pontrygin, Mathematical Theory of Optimal Processes, CRC Press, 1987.
[6] R. Bellman, Dynamic programming and stochastic control processes, Information and Control 1 (3) (1958) 228–239.
[7] H. J. Kushner, Numerical methods for stochastic control problems in continuous time, SIAM Journal on Control and Optimization 28 (5) (1990) 999–1048.
[8] Z. Jin, M. Qiu, K. Q. Tran, G. Yin, A survey of numerical solutions for stochastic control problems: Some recent progress, Numerical Algebra, Control and Optimization 12 (2) (2022) 213–253.
[9] I. Exarchos, E. A. Theodorou, Stochastic optimal control via forward and backward stochastic differential equations and importance sampling, Automatica 87 (2018) 159–165.
[10] A. Gorodetsky, S. Karaman, Y. Marzouk, Efficient high-dimensional stochastic optimal motion control using tensor-train decomposition., in: Robotics: Science and Systems, 2015.
[11] J. Han, W. E, Deep learning approximation for stochastic control problems, in: Deep Reinforcement Learning Workshop, 2016.
[12] Z. Wang, M. Pereira, T. Chen, E. Theodorou, E. Reed, Deep 2FBSDEs for systems with control multiplicative noise, arXiv:1906.04762.
[13] X. Li, D. Verma, L. Ruthotto, A neural network approach for stochastic optimal control, SIAM Journal on Scientific Computing 46 (5) (2024) C535–C556.
[14] W. Cai, S. Fang, T. Zhou, Soc-Martnet: A martingale neural network for the Hamilton-Jacobi-Bellman equation without explicit inf H in stochastic optimal controls, arXiv preprint arXiv:2405.03169.
[15] J.-P. Fouque, Z. Zhang, Deep learning methods for mean field control problems with delay, Frontiers in Applied Mathematics and Statistics 6 (2020) 11.
[16] R. Carmona, M. Lauriére, Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games II: the finite horizon case, Annals of Applied Probability 32 (6) (2022) 4065–4105.
[17] S. Jin, S. Peng, Y. Peng, X. Zhang, Solving stochastic optimal control problem via stochastic maximum principle with deep learning method, Journal of Scientific Computing 93 (2022) 30.
[18] C. Huré, H. Pham, A. Bachouch, N. Langrené, Deep neural networks algorithms for stochastic control problems on finite horizon: Convergence analysis, SIAM Journal on Numerical Analysis 59 (1) (2021) 525–557.
[19] A. Bachouch, C. Huré, N. Langrené, H. Pham, Deep neural networks algorithms for stochastic control problems on finite horizon: Numerical applications, Methodology and Computing in Applied Probability 24 (1) (2022) 143–178.
[20] M. Pereira, Z. Wang, T. Chen, E. Reed, E. Theodorou, Feynman-Kac neural network architectures for stochastic control using second-order fbsde theory, in: Proceedings of the 2nd Conference on Learning for Dynamics and Control, 2020.
[21] N. Nüsken, L. Richter, Solving high-dimensional Hamilton–Jacobi–Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space, Partial Differential Equations and Applications 2 (4) (2021) 48.
[22] M. Hua, M. Laurière, E. Vanden-Eijnden, A simulation-free deep learning approach to stochastic optimal control, arXiv preprint arXiv:2410.05163.
[23] M. Raissi, P. Perdikaris, G. E. Karniadakis, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational Physics 378 (2019) 686–707.
[24] G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, L. Yang, Physics-informed machine learning, Nature Reviews Physics 3 (6) (2021) 422–440.
[25] K. Chung, R. Williams, Introduction to Stochastic Integration, Springer, 2013.
[26] J. Yong, X. Y. Zhou, Stochastic Controls, Springer, 1991.
[27] P. E. Kloeden, E. Platen, Numerical Solution of Stochastic Differential Equations, Springer, 1999.

Solving a class of stochastic optimal control problems by physics-informed neural networks111This research was partially supported by the National Natural Science Foundation of China (12272297).

Abstract

keywords:

1 Introduction

2 Stochastic optimal control

2.1 Problem setup

2.2 Verification theorem

Theorem 1.

3 Deep learning approach

4 Numerical experiments

4.1 Ornstein–Uhlenbeck dynamics

4.2 Metastable dynamics

5 Conclusion

Appendix A Proof of Theorem 1

Appendix B Experiment configuration

References

Solving a class of stochastic optimal control problems by physics-informed neural networks¹¹1This research was partially supported by the National Natural Science Foundation of China (12272297).