Solving a class of stochastic optimal control problems by physics-informed neural networks111This research was partially supported by the National Natural Science Foundation of China (12272297).

Zhe Jiao222Equal contribution. [email protected] Wantao Jia333Equal contribution. [email protected] Weiqiu Zhu [email protected]
Abstract

The aim of this work is to develop a deep learning method for solving high-dimensional stochastic control problems based on the Hamilton–Jacobi–Bellman (HJB) equation and physics-informed learning. Our approach is to parameterize the feedback control and the value function using a decoupled neural network with multiple outputs. We train this network by using a loss function with penalty terms that enforce the HJB equation along the sampled trajectories generated by the controlled system. More significantly, numerical results on various applications are carried out to demonstrate that the proposed approach is efficient and applicable.

keywords:
Stochastic optimal control , High dimension , Hamilton–Jacobi–Bellman equation , Physics-informed learning
\affiliation

[inst1]organization=School of Mathematics and Statistics, addressline=Northwestern Polytechnical University, city=Xi’an, postcode=710129, country=China

\affiliation

[inst2]organization=MOE Key Laboratory for Complexity Science in Aerospace,addressline=Northwestern Polytechnical University, city=Xi’an, postcode=710129, country=China

\affiliation

[inst3]organization=State Key Laboratory of Fluid Power and Mechatronic Systems, Department of Mechanics,addressline=Zhejiang University, city=Hangzhou, postcode=310027, country=China

1 Introduction

The range of stochastic optimal control (SOC) problems covers a variety of scientific branches such as finance [1], molecular dynamics [2], neuroscience [3] and robotics [4]. To address SOC problems, there are two prominent frameworks: Pontryagin’s maximum principle (MP) [5] and Bellman’s dynamic programming (DP) [6]. Drawing on these frameworks, many numerical methods have been developed for tackling SOC problems (cf. [7, 8] and references therein).

However, these traditional numerical methods are not applicable when the state dimension is large [9]. In recent years, there has seen significant progress in leveraging deep learning (DL) to solve the high-dimensional SOC problems [10, 11, 12, 13, 14]. Broadly speaking, the deep neural network-based methods for SOC can be divided into two distinct categories. In the first category, it is concerned with the DL-based approach to solve the extended Hamiltonian system, which is derived from stochastic MP (cf. [15, 16, 17]). For the study of the second category, [18, 19] reformulate the SOC problem as Markov decision process based on DP, which is solved by some DL-based algorithms. Another direction of this category is to solve the SOC problem from the view of DP via HJB equation [20, 21, 22]. We need to point out that in these papers Feynman–Kac formula is the basis to probabilistically represent the solution to HJB equation so that the author can utilizes neural networks to obtain the optimal policy.

Motivated by previous research, we aim to solve the SOC problem with physics-informed learning [23, 24]. The main issue we encounter in our approach is to construct a physics-informed neural network (PINN) for solving HJB equation, which is a semilinear parabolic partial differential equation (PDE) with a terminal value condition. Since the HJB equation is defined on the whole space, without boundary condition, PINN cannot be directly used to compute the value function by solving the HJB equation. Thanks to the stochastic verification theorem (see Theorem 1 in Section 2.2), we can simulate the value function along the trajectories of the controlled system, not on the whole space, by neural network. This is the key idea of our approach.

Our main contribution is twofold: (i) In contrast to [13], we use the controlled SDE to conduct sampling on relevant states during PINN training; (ii) We propose a simulation-free algorithm for SOC by physics-informed learning, which means it dose not require numerical solutions of the control problem.

The remaining part of this paper is organized as follows. In Section 2, we briefly introduce the preliminaries about the SOC problem, and the verification theorem that is the basis to construct our DL based solver. This solver called DeepHJB is proposed in Section 3. Numerical examples in Section 4 illustrate our proposed solver to solve some SOC problems. Section 5 provides some conclusions.

2 Stochastic optimal control

2.1 Problem setup

Let T>t0𝑇𝑡0T>t\geqslant 0italic_T > italic_t ⩾ 0 and W:[t,T]×Ωd:W𝑡𝑇Ωsuperscript𝑑\textbf{W}:[t,T]\times\Omega\rightarrow\mathbb{R}^{d}W : [ italic_t , italic_T ] × roman_Ω → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a d𝑑ditalic_d-dimensional standard 𝔽𝔽\mathbb{F}blackboard_F-Brownian motions on a filtered probability space (Ω,,𝔽,)Ω𝔽(\Omega,\mathcal{F},\mathbb{F},\mathbb{P})( roman_Ω , caligraphic_F , blackboard_F , blackboard_P ) where 𝔽={s}tsT𝔽subscriptsubscript𝑠𝑡𝑠𝑇\mathbb{F}=\{\mathcal{F}_{s}\}_{t\leqslant s\leqslant T}blackboard_F = { caligraphic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ⩽ italic_s ⩽ italic_T end_POSTSUBSCRIPT is the natural filtration generated by W(s)W𝑠\textbf{W}(s)W ( italic_s ). The quadruple (Ω,,𝔽,)Ω𝔽(\Omega,\mathcal{F},\mathbb{F},\mathbb{P})( roman_Ω , caligraphic_F , blackboard_F , blackboard_P ) also satisfies the usual hypotheses (see Chapter 1.4 in [25]). 𝔼[]𝔼delimited-[]\mathbb{E}[\cdot]blackboard_E [ ⋅ ] stands for expectation with respect to the probability measure \mathbb{P}blackboard_P.

We consider the controlled stochastic differential equation (SDE) as follows

dxs=b(s,xs,u(s))ds+σ(s,xs,u(s))dW(s)dsubscriptx𝑠𝑏𝑠subscriptx𝑠u𝑠d𝑠𝜎𝑠subscriptx𝑠u𝑠dW𝑠\mathrm{d}\textbf{x}_{s}=b(s,\textbf{x}_{s},\textbf{u}(s))\mathrm{d}s+\sigma(s% ,\textbf{x}_{s},\textbf{u}(s))\mathrm{d}\textbf{W}(s)roman_d x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_b ( italic_s , x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , u ( italic_s ) ) roman_d italic_s + italic_σ ( italic_s , x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , u ( italic_s ) ) roman_d W ( italic_s ) (1)

with s[t0,T]𝑠subscript𝑡0𝑇s\in[t_{0},T]italic_s ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ] and the initial data xt0=xnsubscriptxsubscript𝑡0𝑥superscript𝑛\textbf{x}_{t_{0}}=x\in\mathbb{R}^{n}x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Here, xsnsubscriptx𝑠superscript𝑛\textbf{x}_{s}\in\mathbb{R}^{n}x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the state process, u(s)mu𝑠superscript𝑚\textbf{u}(s)\in\mathbb{R}^{m}u ( italic_s ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is a control process valued in a given subset U𝑈Uitalic_U of msuperscript𝑚\mathbb{R}^{m}blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. The cost functional is given by

J(t0,x;u(t))=𝔼[t0Tϕ(s,xs,u(s))ds+ψ(xT)|xt=x]𝐽subscript𝑡0𝑥u𝑡𝔼delimited-[]superscriptsubscriptsubscript𝑡0𝑇italic-ϕ𝑠subscriptx𝑠u𝑠differential-d𝑠conditional𝜓subscriptx𝑇subscriptx𝑡𝑥\displaystyle J(t_{0},x;\textbf{u}(t))=\mathbb{E}\left[\int_{t_{0}}^{T}\phi(s,% \textbf{x}_{s},\textbf{u}(s))\mathrm{d}s+\psi(\textbf{x}_{T})|\textbf{x}_{t}=x\right]italic_J ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x ; u ( italic_t ) ) = blackboard_E [ ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ϕ ( italic_s , x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , u ( italic_s ) ) roman_d italic_s + italic_ψ ( x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) | x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_x ] (2)

with the functions ϕ:[t0,T]×n×m:italic-ϕsubscript𝑡0𝑇superscript𝑛superscript𝑚\phi:[t_{0},T]\times\mathbb{R}^{n}\times\mathbb{R}^{m}\rightarrow\mathbb{R}italic_ϕ : [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ] × blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R and ψ:n:𝜓superscript𝑛\psi:\mathbb{R}^{n}\rightarrow\mathbb{R}italic_ψ : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R. The goal of our SOC problem is to look for an admissible control (if exists) that minimizes (2) over 𝒰𝒰\mathcal{U}caligraphic_U which is the set of all admissible controls defined by

𝒰:={u:[t0,T]×ΩU|u(s)L𝔽2(t0,T;m)}assign𝒰conditional-set𝑢subscript𝑡0𝑇Ωconditional𝑈𝑢𝑠subscriptsuperscript𝐿2𝔽subscript𝑡0𝑇superscript𝑚\mathcal{U}:=\left\{u:[t_{0},T]\times\Omega\rightarrow U|u(s)\in L^{2}_{% \mathbb{F}}(t_{0},T;\mathbb{R}^{m})\right\}caligraphic_U := { italic_u : [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ] × roman_Ω → italic_U | italic_u ( italic_s ) ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT blackboard_F end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ; blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) }

in which L𝔽2(t,T;m)subscriptsuperscript𝐿2𝔽𝑡𝑇superscript𝑚L^{2}_{\mathbb{F}}(t,T;\mathbb{R}^{m})italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT blackboard_F end_POSTSUBSCRIPT ( italic_t , italic_T ; blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) consists of all 𝔽𝔽\mathbb{F}blackboard_F-adapted functions u:[t0,T]×Ωm:𝑢subscript𝑡0𝑇Ωsuperscript𝑚u:[t_{0},T]\times\Omega\rightarrow\mathbb{R}^{m}italic_u : [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ] × roman_Ω → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT satisfying 𝔼[t0T|u|2ds]<𝔼delimited-[]superscriptsubscriptsubscript𝑡0𝑇superscript𝑢2differential-d𝑠\mathbb{E}[\int_{t_{0}}^{T}|u|^{2}\mathrm{d}s]<\inftyblackboard_E [ ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_u | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_s ] < ∞.

In this paper, we focus on the SOC problem under the following conditions.

  • 1.

    The drift term bn𝑏superscript𝑛b\in\mathbb{R}^{n}italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and the diffusion term σn×(1+d)𝜎superscript𝑛1𝑑\sigma\in\mathbb{R}^{n\times(1+d)}italic_σ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × ( 1 + italic_d ) end_POSTSUPERSCRIPT in (1) have the following linear forms in control

    b(s,xs,u(s))=A(s,xs)+B(s,xs)u(s),𝑏𝑠subscriptx𝑠u𝑠𝐴𝑠subscriptx𝑠𝐵𝑠subscriptx𝑠u𝑠b(s,\textbf{x}_{s},\textbf{u}(s))=A(s,\textbf{x}_{s})+B(s,\textbf{x}_{s})% \textbf{u}(s),italic_b ( italic_s , x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , u ( italic_s ) ) = italic_A ( italic_s , x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_B ( italic_s , x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) u ( italic_s ) ,

    and

    σ(s,xs,u(s))=[λB(s,xs)u(s),C(s,xs)]𝜎𝑠subscriptx𝑠u𝑠𝜆𝐵𝑠subscriptx𝑠u𝑠𝐶𝑠subscriptx𝑠\sigma(s,\textbf{x}_{s},\textbf{u}(s))=[\lambda B(s,\textbf{x}_{s})\textbf{u}(% s),C(s,\textbf{x}_{s})]italic_σ ( italic_s , x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , u ( italic_s ) ) = [ italic_λ italic_B ( italic_s , x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) u ( italic_s ) , italic_C ( italic_s , x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ]

    with λ0𝜆0\lambda\geqslant 0italic_λ ⩾ 0, An𝐴superscript𝑛A\in\mathbb{R}^{n}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, Bn×m𝐵superscript𝑛𝑚B\in\mathbb{R}^{n\times m}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_m end_POSTSUPERSCRIPT and Cn×d𝐶superscript𝑛𝑑C\in\mathbb{R}^{n\times d}italic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT.

  • 2.

    The random term Ws=[ws(1),ws(2)](1+d)subscriptW𝑠subscriptsuperscript𝑤1𝑠subscriptsuperscript𝑤2𝑠superscript1𝑑\textbf{W}_{s}=[w^{(1)}_{s},w^{(2)}_{s}]\in\mathbb{R}^{(1+d)}W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = [ italic_w start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT ( 1 + italic_d ) end_POSTSUPERSCRIPT in which ws(1)1subscriptsuperscript𝑤1𝑠superscript1w^{(1)}_{s}\in\mathbb{R}^{1}italic_w start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and ws(2)dsubscriptsuperscript𝑤2𝑠superscript𝑑w^{(2)}_{s}\in\mathbb{R}^{d}italic_w start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT are mutually independent Brownian motions.

  • 3.

    The running cost in (2) is quadratic, that is,

    ϕ(s,xs,u(s))=xsFxs+12u(s)Du(s)italic-ϕ𝑠subscriptx𝑠u𝑠superscriptsubscriptx𝑠top𝐹subscriptx𝑠12usuperscript𝑠top𝐷u𝑠\phi(s,\textbf{x}_{s},\textbf{u}(s))=\textbf{x}_{s}^{\top}F\textbf{x}_{s}+% \frac{1}{2}\textbf{u}(s)^{\top}D\textbf{u}(s)italic_ϕ ( italic_s , x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , u ( italic_s ) ) = x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_F x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG u ( italic_s ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D u ( italic_s )

    with the coefficients Fn×n𝐹superscript𝑛𝑛F\in\mathbb{R}^{n\times n}italic_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT and Dm×m𝐷superscript𝑚𝑚D\in\mathbb{R}^{m\times m}italic_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT.

  • 4.

    The terminal cost is linear

    ψ(x)=γx𝜓𝑥𝛾𝑥\psi(x)=\gamma\cdot xitalic_ψ ( italic_x ) = italic_γ ⋅ italic_x

    or quadratic

    ψ(x)=(xxT)FT(xxT)𝜓𝑥superscript𝑥subscriptx𝑇topsubscript𝐹𝑇𝑥subscriptx𝑇\psi(x)=(x-\textbf{x}_{T})^{\top}F_{T}(x-\textbf{x}_{T})italic_ψ ( italic_x ) = ( italic_x - x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x - x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )

    with the coefficients γn𝛾superscript𝑛\gamma\in\mathbb{R}^{n}italic_γ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and FTn×nsubscript𝐹𝑇superscript𝑛𝑛F_{T}\in\mathbb{R}^{n\times n}italic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT.

Under suitable assumptions (see Chapter 1 in [26]), for any u(s)𝒰u𝑠𝒰\textbf{u}(s)\in\mathcal{U}u ( italic_s ) ∈ caligraphic_U equation (1) has a unique solution xssubscriptx𝑠\textbf{x}_{s}x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and the cost function (2) is well-defined. We call (xs,u(s))subscriptx𝑠u𝑠(\textbf{x}_{s},\textbf{u}(s))( x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , u ( italic_s ) ) an admissible pair. Any u(s)superscriptu𝑠\textbf{u}^{\ast}(s)u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) is called an optimal control if it satisfies

u(s):=argmin𝐮(s)𝒰J(t,x;u(s)).assignsuperscriptu𝑠subscript𝐮𝑠𝒰𝐽𝑡𝑥u𝑠\textbf{u}^{\ast}(s):=\mathop{\arg\min}\limits_{\mathbf{u}(s)\in\mathcal{U}}J(% t,x;\textbf{u}(s)).u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) := start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT bold_u ( italic_s ) ∈ caligraphic_U end_POSTSUBSCRIPT italic_J ( italic_t , italic_x ; u ( italic_s ) ) .

The corresponding state process xssubscriptsuperscriptx𝑠\textbf{x}^{\ast}_{s}x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is called an optimal trajectory and the state-control pair (xs,u(s))subscriptsuperscriptx𝑠superscriptu𝑠(\textbf{x}^{\ast}_{s},\textbf{u}^{\ast}(s))( x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) ) called an optimal pair.

2.2 Verification theorem

We define the value function as

q(t,x):=J(t,x;u(s))=min𝐮(s)𝒰J(t,x;u(s)).assign𝑞𝑡𝑥𝐽𝑡𝑥superscriptu𝑠subscript𝐮𝑠𝒰𝐽𝑡𝑥u𝑠q(t,x):=J(t,x;\textbf{u}^{\ast}(s))=\min\limits_{\mathbf{u}(s)\in\mathcal{U}}J% (t,x;\textbf{u}(s)).italic_q ( italic_t , italic_x ) := italic_J ( italic_t , italic_x ; u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) ) = roman_min start_POSTSUBSCRIPT bold_u ( italic_s ) ∈ caligraphic_U end_POSTSUBSCRIPT italic_J ( italic_t , italic_x ; u ( italic_s ) ) .

The following theorem shows the evolution of the value function along the optimal trajectory and at the the final time, which is deduced from the stochastic verification theorem (see Theorem 5.1 in Chapter 5.5 of [26]). The detailed proof is given in A.

Refer to caption
Figure 1: A decoupled neural network structure for solving HJB equation (6) with multiple outputs. This neural network is a type of PINNs which integrate the information from PDEs (6) into the loss function of a neural network using automatic differentiation. The architecture of the decoupled hidden layers are denoted by 𝒜isubscript𝒜𝑖\mathcal{A}_{i}caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i=1,2𝑖12i=1,2italic_i = 1 , 2, which will be given in B.
Theorem 1.

An admissible pair (𝐱t,𝐮(t))subscript𝐱𝑡𝐮𝑡(\mathbf{x}_{t},\mathbf{u}(t))( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_u ( italic_t ) ), where the feedback control 𝐮(t)𝐮𝑡\mathbf{u}(t)bold_u ( italic_t ) is given by

𝐮(t)=D~(t,𝐱t)1B(t,𝐱t)q(t,𝐱t)𝐮𝑡~𝐷superscript𝑡subscript𝐱𝑡1𝐵superscript𝑡subscript𝐱𝑡top𝑞𝑡subscript𝐱𝑡\mathbf{u}(t)=-\tilde{D}(t,\mathbf{x}_{t})^{-1}B(t,\mathbf{x}_{t})^{\top}% \nabla q(t,\mathbf{x}_{t})bold_u ( italic_t ) = - over~ start_ARG italic_D end_ARG ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_q ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (3)

with

D~(t,𝐱t)=D+λ2B(t,𝐱t)2q(t,𝐱t)B(t,𝐱t),~𝐷𝑡subscript𝐱𝑡𝐷superscript𝜆2𝐵superscript𝑡subscript𝐱𝑡topsuperscript2𝑞𝑡subscript𝐱𝑡𝐵𝑡subscript𝐱𝑡\tilde{D}(t,\mathbf{x}_{t})=D+\lambda^{2}B(t,\mathbf{x}_{t})^{\top}\nabla^{2}q% (t,\mathbf{x}_{t})B(t,\mathbf{x}_{t}),over~ start_ARG italic_D end_ARG ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_D + italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_B ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_B ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

is optimal if and only if the following HJB equation holds

tq(t,𝐱t)=subscript𝑡𝑞𝑡subscript𝐱𝑡absent\displaystyle-\partial_{t}q(t,\mathbf{x}_{t})=- ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_q ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = H(t,𝐱t,𝐮(t),q(t,𝐱t),2q(t,𝐱t))𝐻𝑡subscript𝐱𝑡𝐮𝑡𝑞𝑡subscript𝐱𝑡superscript2𝑞𝑡subscript𝐱𝑡\displaystyle H\left(t,\mathbf{x}_{t},\mathbf{u}(t),\nabla q(t,\mathbf{x}_{t})% ,\nabla^{2}q(t,\mathbf{x}_{t})\right)italic_H ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_u ( italic_t ) , ∇ italic_q ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (4)
=\displaystyle== 12tr[C(t,𝐱t)2qC(t,𝐱t)]12trdelimited-[]𝐶superscript𝑡subscript𝐱𝑡topsuperscript2𝑞𝐶𝑡subscript𝐱𝑡\displaystyle\frac{1}{2}\mathrm{tr}\left[C(t,\mathbf{x}_{t})^{\top}\nabla^{2}% qC(t,\mathbf{x}_{t})\right]divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_tr [ italic_C ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q italic_C ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ]
12(q)[B(t,𝐱t)D~1B(t,𝐱t)]q12superscript𝑞topdelimited-[]𝐵𝑡subscript𝐱𝑡superscript~𝐷1𝐵superscript𝑡subscript𝐱𝑡top𝑞\displaystyle-\frac{1}{2}(\nabla q)^{\top}\left[B(t,\mathbf{x}_{t})\tilde{D}^{% -1}B(t,\mathbf{x}_{t})^{\top}\right]\nabla q- divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( ∇ italic_q ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ italic_B ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] ∇ italic_q
+A(t,𝐱t)q+𝐱tF𝐱t𝐴𝑡subscript𝐱𝑡𝑞superscriptsubscript𝐱𝑡top𝐹subscript𝐱𝑡\displaystyle+A(t,\mathbf{x}_{t})\cdot\nabla q+\mathbf{x}_{t}^{\top}F\mathbf{x% }_{t}+ italic_A ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ ∇ italic_q + bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_F bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

for any t[0,T)𝑡0𝑇t\in[0,T)italic_t ∈ [ 0 , italic_T ), and q(T,𝐱T)=ψ(𝐱T)𝑞𝑇subscript𝐱𝑇𝜓subscript𝐱𝑇q(T,\mathbf{x}_{T})=\psi(\mathbf{x}_{T})italic_q ( italic_T , bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = italic_ψ ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ).

Here, tqsubscript𝑡𝑞\partial_{t}q∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_q means the first-order derivative of q𝑞qitalic_q with respect to t𝑡titalic_t, q𝑞\nabla q∇ italic_q and 2qsuperscript2𝑞\nabla^{2}q∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q respectively denote the gradient and the Hessian of q(t,x)𝑞𝑡𝑥q(t,x)italic_q ( italic_t , italic_x ) with respect to x𝑥xitalic_x, and trtr\mathrm{tr}roman_tr is the abbreviation of the trace operator.

3 Deep learning approach

In this section we propose our approach to seek an optimal pair that minimizes the cost functional (2) subject to (1) for initial data sampled from a probability distribution in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with a density denoted by ρ𝜌\rhoitalic_ρ.

We select a partition of the time interval [0,T]0𝑇[0,T][ 0 , italic_T ]:

0=t0<t1<<tn<<tN=T,0subscript𝑡0subscript𝑡1subscript𝑡𝑛subscript𝑡𝑁𝑇0=t_{0}<t_{1}<\cdots<t_{n}<\cdots<t_{N}=T,0 = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < ⋯ < italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT < ⋯ < italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = italic_T ,

and denote by tn=tn+1tnsubscript𝑡𝑛subscript𝑡𝑛1subscript𝑡𝑛\triangle t_{n}=t_{n+1}-t_{n}△ italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT the (i+1)𝑖1(i+1)( italic_i + 1 )th interval of the grid and Wn=Wtn+1Wtnsubscript𝑊𝑛subscript𝑊subscript𝑡𝑛1subscript𝑊subscript𝑡𝑛\triangle W_{n}=W_{t_{n+1}}-W_{t_{n}}△ italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_W start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT the (i+1)𝑖1(i+1)( italic_i + 1 )-th increment of the Brownian motion. Once the control u(t)𝑢𝑡u(t)italic_u ( italic_t ) is computed, the Euler–Maruyama scheme (cf. [27]) of (1) gives

xtn+1xtn=b(tn,xtn,u(tn))tn+σ(tn,xtn,u(tn))Wnsubscript𝑥subscript𝑡𝑛1subscript𝑥subscript𝑡𝑛𝑏subscript𝑡𝑛subscript𝑥subscript𝑡𝑛𝑢subscript𝑡𝑛subscript𝑡𝑛𝜎subscript𝑡𝑛subscript𝑥subscript𝑡𝑛𝑢subscript𝑡𝑛subscript𝑊𝑛\displaystyle x_{t_{n+1}}-x_{t_{n}}=b(t_{n},x_{t_{n}},u(t_{n}))\triangle t_{n}% +\sigma(t_{n},x_{t_{n}},u(t_{n}))\triangle W_{n}italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_b ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_u ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) △ italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_σ ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_u ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) △ italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (5)

with the initial data xt0=xρsubscript𝑥subscript𝑡0𝑥similar-to𝜌x_{t_{0}}=x\sim\rhoitalic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_x ∼ italic_ρ. Using the numerical scheme (5), the path {(tn,xtn)}0nNsubscriptsubscript𝑡𝑛subscript𝑥subscript𝑡𝑛0𝑛𝑁\{(t_{n},x_{t_{n}})\}_{0\leqslant n\leqslant N}{ ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT 0 ⩽ italic_n ⩽ italic_N end_POSTSUBSCRIPT can be easily generated. If the value function q(t,xt)𝑞𝑡subscript𝑥𝑡q(t,x_{t})italic_q ( italic_t , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is known and the discretization of the admissible pair {(xtn,u(tn))}0nNsubscriptsubscript𝑥subscript𝑡𝑛𝑢subscript𝑡𝑛0𝑛𝑁\{(x_{t_{n}},u(t_{n}))\}_{0\leqslant n\leqslant N}{ ( italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_u ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) } start_POSTSUBSCRIPT 0 ⩽ italic_n ⩽ italic_N end_POSTSUBSCRIPT satisfies

{tq(tn,xtn)=H(tn,xtn,u(tn),q(tn,xtn),2q(tn,xtn)),q(tN,xtN)=ψ(xtN),casessubscript𝑡𝑞subscript𝑡𝑛subscript𝑥subscript𝑡𝑛𝐻subscript𝑡𝑛subscript𝑥subscript𝑡𝑛𝑢subscript𝑡𝑛𝑞subscript𝑡𝑛subscript𝑥subscript𝑡𝑛superscript2𝑞subscript𝑡𝑛subscript𝑥subscript𝑡𝑛missing-subexpression𝑞subscript𝑡𝑁subscript𝑥subscript𝑡𝑁𝜓subscript𝑥subscript𝑡𝑁missing-subexpression\left\{\begin{array}[]{ll}-\partial_{t}q(t_{n},x_{t_{n}})=H\left(t_{n},x_{t_{n% }},u(t_{n}),\nabla q(t_{n},x_{t_{n}}),\nabla^{2}q(t_{n},x_{t_{n}})\right),\\ q\left(t_{N},x_{t_{N}}\right)=\psi\left(x_{t_{N}}\right),\end{array}\right.{ start_ARRAY start_ROW start_CELL - ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_q ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = italic_H ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_u ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , ∇ italic_q ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_q ( italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = italic_ψ ( italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , end_CELL start_CELL end_CELL end_ROW end_ARRAY (6)

from (4) in Theorem 1 we know {(xtn,u(tn))}0nNsubscriptsubscript𝑥subscript𝑡𝑛𝑢subscript𝑡𝑛0𝑛𝑁\{(x_{t_{n}},u(t_{n}))\}_{0\leqslant n\leqslant N}{ ( italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_u ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) } start_POSTSUBSCRIPT 0 ⩽ italic_n ⩽ italic_N end_POSTSUBSCRIPT is an optimal pair.

Our approach parameterizes the functions u𝑢uitalic_u and q𝑞qitalic_q by a decoupled neural network (Figure 1), which are given by

uNN(tn,xtn;θu),qNN(tn,xtn;θq).superscript𝑢NNsubscript𝑡𝑛subscript𝑥subscript𝑡𝑛subscript𝜃𝑢superscript𝑞NNsubscript𝑡𝑛subscript𝑥subscript𝑡𝑛subscript𝜃𝑞u^{\textrm{NN}}(t_{n},x_{t_{n}};\theta_{u}),\quad q^{\textrm{NN}}(t_{n},x_{t_{% n}};\theta_{q}).italic_u start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) , italic_q start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) .

We denote by Θ={θu,θq}Θsubscript𝜃𝑢subscript𝜃𝑞\Theta=\{\theta_{u},\theta_{q}\}roman_Θ = { italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT } the weights of the neural network, which is trained by minimizing the sum of the expected losses that arises from the following penalty terms.

  • 1.

    The second-order HJB penalty terms are defined as

    1(θu,θq)=|tqNN+H(tn,xtn,uNN,qNN,2qNN)|subscript1subscript𝜃𝑢subscript𝜃𝑞subscript𝑡superscript𝑞NN𝐻subscript𝑡𝑛subscript𝑥subscript𝑡𝑛superscript𝑢NNsuperscript𝑞NNsuperscript2superscript𝑞NN\displaystyle\mathcal{L}_{1}(\theta_{u},\theta_{q})=\left|\partial_{t}q^{% \textrm{NN}}+H\left(t_{n},x_{t_{n}},u^{\textrm{NN}},\nabla q^{\textrm{NN}},% \nabla^{2}q^{\textrm{NN}}\right)\right|caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) = | ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_q start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT + italic_H ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT , ∇ italic_q start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT , ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT ) |

    and

    2(θq)=subscript2subscript𝜃𝑞absent\displaystyle\mathcal{L}_{2}(\theta_{q})=caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) = |qNN(tN,xtN;θq)ψ(xtN)|,superscript𝑞NNsubscript𝑡𝑁subscript𝑥subscript𝑡𝑁subscript𝜃𝑞𝜓subscript𝑥subscript𝑡𝑁\displaystyle\left|q^{\textrm{NN}}\left(t_{N},x_{t_{N}};\theta_{q}\right)-\psi% \left(x_{t_{N}}\right)\right|,| italic_q start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) - italic_ψ ( italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) | ,

    where 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are derived from (6).

  • 2.

    Another penalty term is given as follows

    3(θu,θq)=|uNN+D~1B(tn,xtn)qNN|subscript3subscript𝜃𝑢subscript𝜃𝑞superscript𝑢NNsuperscript~𝐷1superscript𝐵topsubscript𝑡𝑛subscript𝑥subscript𝑡𝑛superscript𝑞NN\mathcal{L}_{3}(\theta_{u},\theta_{q})=\left|u^{\textrm{NN}}+\tilde{D}^{-1}B^{% \top}(t_{n},x_{t_{n}})\nabla q^{\textrm{NN}}\right|caligraphic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) = | italic_u start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT + over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∇ italic_q start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT |

    with

    D~=D+λ2B(tn,xtn)2qNN(tn,xtn;θq)B(tn,xtn),~𝐷𝐷superscript𝜆2𝐵superscriptsubscript𝑡𝑛subscript𝑥subscript𝑡𝑛topsuperscript2superscript𝑞NNsubscript𝑡𝑛subscript𝑥subscript𝑡𝑛subscript𝜃𝑞𝐵subscript𝑡𝑛subscript𝑥subscript𝑡𝑛\tilde{D}=D+\lambda^{2}B(t_{n},x_{t_{n}})^{\top}\nabla^{2}q^{\textrm{NN}}(t_{n% },x_{t_{n}};\theta_{q})B(t_{n},x_{t_{n}}),over~ start_ARG italic_D end_ARG = italic_D + italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_B ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) italic_B ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,

    where 3subscript3\mathcal{L}_{3}caligraphic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is from (3).

Now, we can define the physics-informed learning problem

minΘ𝔼xρ{α11+α22+α33}.subscriptΘsubscript𝔼similar-to𝑥𝜌subscript𝛼1subscript1subscript𝛼2subscript2subscript𝛼3subscript3\min_{\Theta}\mathbb{E}_{x\sim\rho}\left\{\alpha_{1}\mathcal{L}_{1}+\alpha_{2}% \mathcal{L}_{2}+\alpha_{3}\mathcal{L}_{3}\right\}.roman_min start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_ρ end_POSTSUBSCRIPT { italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } .

The coefficients α1>0subscript𝛼10\alpha_{1}>0italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0, α2>0subscript𝛼20\alpha_{2}>0italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 and α3>0subscript𝛼30\alpha_{3}>0italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT > 0 are supposed to be fixed.

Finally, we apply a SGD-type algorithm to optimize the parameter ΘΘ\Thetaroman_Θ. The pseudo-code for implementing the above approach is given in Algorithm 1.

Algorithm 1 DeepHJB solver

Input: the initial data {(t0,xt0(i))}1iMsubscriptsubscript𝑡0subscriptsuperscript𝑥𝑖subscript𝑡01𝑖𝑀\{(t_{0},x^{(i)}_{t_{0}})\}_{1\leqslant i\leqslant M}{ ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT 1 ⩽ italic_i ⩽ italic_M end_POSTSUBSCRIPT, parameter N𝑁Nitalic_N of partition, parameters θ(0)superscript𝜃0\theta^{(0)}italic_θ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT of networks, learning rate η𝜂\etaitalic_η, max-step K𝐾Kitalic_K

For k=0𝑘0k=0italic_k = 0 to K1𝐾1K-1italic_K - 1 do

For i=1𝑖1i=1italic_i = 1 to M𝑀Mitalic_M do

For n=0𝑛0n=0italic_n = 0 to N𝑁Nitalic_N do

qNN(tn,xtn(i);θq(k))superscript𝑞NNsubscript𝑡𝑛subscriptsuperscript𝑥𝑖subscript𝑡𝑛superscriptsubscript𝜃𝑞𝑘q^{\textrm{NN}}(t_{n},x^{(i)}_{t_{n}};\theta_{q}^{(k)})italic_q start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT )

uNN(tn,xtn(i);θu(k))superscript𝑢NNsubscript𝑡𝑛subscriptsuperscript𝑥𝑖subscript𝑡𝑛superscriptsubscript𝜃𝑢𝑘u^{\textrm{NN}}(t_{n},x^{(i)}_{t_{n}};\theta_{u}^{(k)})italic_u start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT )

while n+1N𝑛1𝑁n+1\leqslant Nitalic_n + 1 ⩽ italic_N do

tn=tn+1tnsubscript𝑡𝑛subscript𝑡𝑛1subscript𝑡𝑛\triangle t_{n}=t_{n+1}-t_{n}△ italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

Wn(i)=Wtn+1(i)Wtn(i)subscriptsuperscript𝑊𝑖𝑛subscriptsuperscript𝑊𝑖subscript𝑡𝑛1subscriptsuperscript𝑊𝑖subscript𝑡𝑛\triangle W^{(i)}_{n}=W^{(i)}_{t_{n+1}}-W^{(i)}_{t_{n}}△ italic_W start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_W start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_W start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT

xtn+1(i)=xtn(i)+b(tn,xtn(i),uNN)tn+σ(tn,xtn(i),uNN)Wn(i)subscriptsuperscript𝑥𝑖subscript𝑡𝑛1subscriptsuperscript𝑥𝑖subscript𝑡𝑛𝑏subscript𝑡𝑛subscriptsuperscript𝑥𝑖subscript𝑡𝑛superscript𝑢NNsubscript𝑡𝑛𝜎subscript𝑡𝑛subscriptsuperscript𝑥𝑖subscript𝑡𝑛superscript𝑢NNsubscriptsuperscript𝑊𝑖𝑛x^{(i)}_{t_{n+1}}=x^{(i)}_{t_{n}}+b(t_{n},x^{(i)}_{t_{n}},u^{\textrm{NN}})% \triangle t_{n}+\sigma(t_{n},x^{(i)}_{t_{n}},u^{\textrm{NN}})\triangle W^{(i)}% _{n}italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_b ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT ) △ italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_σ ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT ) △ italic_W start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

end while

end for

end for

Θ(k)=(θu(k),θq(k))superscriptΘ𝑘superscriptsubscript𝜃𝑢𝑘superscriptsubscript𝜃𝑞𝑘\Theta^{(k)}=(\theta_{u}^{(k)},\theta_{q}^{(k)})roman_Θ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = ( italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT )

random set Bk{1,2,,M}subscript𝐵𝑘12𝑀B_{k}\subset\{1,2,\cdots,M\}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊂ { 1 , 2 , ⋯ , italic_M }

Loss=1|Bk|iBk{α11(i)(Θ(k))+α22(i)(Θ(k))+α33(i)(Θ(k))}Loss1subscript𝐵𝑘subscript𝑖subscript𝐵𝑘subscript𝛼1subscriptsuperscript𝑖1superscriptΘ𝑘subscript𝛼2subscriptsuperscript𝑖2superscriptΘ𝑘subscript𝛼3subscriptsuperscript𝑖3superscriptΘ𝑘\mathrm{Loss}=\frac{1}{|B_{k}|}\sum\limits_{i\in B_{k}}\left\{\alpha_{1}% \mathcal{L}^{(i)}_{1}(\Theta^{(k)})+\alpha_{2}\mathcal{L}^{(i)}_{2}(\Theta^{(k% )})+\alpha_{3}\mathcal{L}^{(i)}_{3}(\Theta^{(k)})\right\}roman_Loss = divide start_ARG 1 end_ARG start_ARG | italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( roman_Θ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) + italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Θ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) + italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( roman_Θ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) }

Θ(k+1)=Θ(k)ηLosssuperscriptΘ𝑘1superscriptΘ𝑘𝜂Loss\Theta^{(k+1)}=\Theta^{(k)}-\eta\nabla\mathrm{Loss}roman_Θ start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT = roman_Θ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - italic_η ∇ roman_Loss

end for

4 Numerical experiments

In this section, we apply the DeepHJB solver to some SOC problems. In the following subsections, we discuss the controlled Ornstein–Uhlenbeck (OU) dynamics and the controlled metastable dynamics, respectively. To evaluate the proposed solver, we introduce the following L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT error as the performance metric

𝔼[0T|uNNu|2(t,xt)𝑑s],𝔼delimited-[]superscriptsubscript0𝑇superscriptsuperscript𝑢NNsuperscript𝑢2𝑡subscript𝑥𝑡differential-d𝑠\mathbb{E}\left[\int_{0}^{T}|u^{\textrm{NN}}-u^{\ast}|^{2}(t,x_{t})ds\right],blackboard_E [ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_u start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT - italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_s ] ,

where usuperscript𝑢u^{\ast}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the baseline optimal control. The detailed configurations of these experiments can be seen in B.

4.1 Ornstein–Uhlenbeck dynamics

We investigate the controlled system with

A=In×n+(ξij)1i,jn,B=C=In×n+(ξij)1i,jn,λ=0,\begin{split}A&=-I_{n\times n}+(\xi_{ij})_{1\leqslant i,j\leqslant n},\\ B=C&=I_{n\times n}+(\xi_{ij})_{1\leqslant i,j\leqslant n},\quad\lambda=0,\end{split}start_ROW start_CELL italic_A end_CELL start_CELL = - italic_I start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT + ( italic_ξ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 ⩽ italic_i , italic_j ⩽ italic_n end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_B = italic_C end_CELL start_CELL = italic_I start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT + ( italic_ξ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 ⩽ italic_i , italic_j ⩽ italic_n end_POSTSUBSCRIPT , italic_λ = 0 , end_CELL end_ROW

where ξij𝒩(0,0.01)similar-tosubscript𝜉𝑖𝑗𝒩00.01\xi_{ij}\sim\mathcal{N}(0,0.01)italic_ξ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 0.01 ) are sampled once at the beginning of the experiments.

Refer to caption
(a) Optimal control
Refer to caption
(b) L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT error
Figure 2: Performance of the DeepHJB solver for 30303030 dimensional controlled OU dynamics with linear terminal cost. (a) Comparison of uNNsuperscript𝑢NNu^{\textrm{NN}}italic_u start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT and usuperscript𝑢u^{\ast}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Here, 7777 out of the 30303030 components of the optimal control are pictured. (b) L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT error with respect to the iteration step.

For the SOC problem with linear terminal cost, we choose

F=0,D=In×n,γ=(1,,1).formulae-sequence𝐹0formulae-sequence𝐷subscript𝐼𝑛𝑛𝛾superscript11topF=0,\quad D=I_{n\times n},\quad\gamma=(1,\cdots,1)^{\top}.italic_F = 0 , italic_D = italic_I start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT , italic_γ = ( 1 , ⋯ , 1 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

In this situation, the optimal control can be given analytically by

u(t)=BeA(Tt)γ,superscript𝑢𝑡superscript𝐵topsuperscript𝑒superscript𝐴top𝑇𝑡𝛾u^{\ast}(t)=-B^{\top}e^{A^{\top}(T-t)}\gamma,italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_t ) = - italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_T - italic_t ) end_POSTSUPERSCRIPT italic_γ ,

which has been calculated in [21]. We set the initial value to be zero and the terminal time T=1.0𝑇1.0T=1.0italic_T = 1.0. In Figure 2, the subfigure (a) gives a visible comparison of the optimal control between uNNsuperscript𝑢NNu^{\textrm{NN}}italic_u start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT calculated by the DeepHJB solver and the baseline usuperscript𝑢u^{\ast}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, while the subfigure (b) shows the evolution of the error L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT against the iteration step. It can be seen that the optimal control approximated by the proposed solver well coincides with the analytical one.

Refer to caption
(a) Optimal control
Refer to caption
(b) L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT error
Figure 3: Performance of the DeepHJB solver for 15151515 dimensional controlled OU dynamics with a quadratic terminal cost. (a) Comparison of uNNsuperscript𝑢NNu^{\textrm{NN}}italic_u start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT and usuperscript𝑢u^{\ast}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for varying time t𝑡titalic_t and at a fixed x=(0.5,,0.5)𝑥0.50.5x=(0.5,\cdots,0.5)italic_x = ( 0.5 , ⋯ , 0.5 ). Here, four components of the optimal control are plotted. (b) L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT error with respect to the iteration step.

Regarding the case with a quadratic terminal cost, we choose

F=12In×n,D=In×n,FT=In×n.\begin{split}F=\frac{1}{2}I_{n\times n},\quad D=I_{n\times n},\quad F_{T}=I_{n% \times n}.\end{split}start_ROW start_CELL italic_F = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_I start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT , italic_D = italic_I start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT . end_CELL end_ROW

This type of problems has an analytic optimal control

𝐮(t,x)=2BPtxsuperscript𝐮𝑡𝑥2superscript𝐵topsubscript𝑃𝑡𝑥\mathbf{u}^{\ast}(t,x)=-2B^{\top}P_{t}xbold_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_t , italic_x ) = - 2 italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x

in which Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT fulfills the Riccati equation

ddtPt+APt+PtA2PtBBPt+F=0𝑑𝑑𝑡subscript𝑃𝑡superscript𝐴topsubscript𝑃𝑡subscript𝑃𝑡𝐴2subscript𝑃𝑡𝐵superscript𝐵topsubscript𝑃𝑡𝐹0\frac{d}{dt}P_{t}+A^{\top}P_{t}+P_{t}A-2P_{t}BB^{\top}P_{t}+F=0divide start_ARG italic_d end_ARG start_ARG italic_d italic_t end_ARG italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_A start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_A - 2 italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_B italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_F = 0

with PT=FTsubscript𝑃𝑇subscript𝐹𝑇P_{T}=F_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT (see [26, Chapter 6]). We choose the initial value from a pre-specified distribution ? and the terminal time T=0.5𝑇0.5T=0.5italic_T = 0.5. Figure 3 displays the direct comparison and L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT error between the approximation uNNsuperscript𝑢NNu^{\textrm{NN}}italic_u start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT and the baseline usuperscript𝑢u^{\ast}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of the solution to this SOC problem, which illustrates the accuracy of our DeepHJB solver.

4.2 Metastable dynamics

We consider the double well

Ψ(x)=i=1nκi(xi21)2,κi>0.formulae-sequenceΨ𝑥superscriptsubscript𝑖1𝑛subscript𝜅𝑖superscriptsuperscriptsubscript𝑥𝑖212subscript𝜅𝑖0\Psi(x)=\sum_{i=1}^{n}\kappa_{i}(x_{i}^{2}-1)^{2},\quad\kappa_{i}>0.roman_Ψ ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_κ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_κ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 .

and the controlled system with

A(𝐱t)=Ψ,B=C=In×n,λ=0.formulae-sequenceformulae-sequence𝐴subscript𝐱𝑡Ψ𝐵𝐶subscript𝐼𝑛𝑛𝜆0A(\mathbf{x}_{t})=-\nabla\Psi,\quad B=C=I_{n\times n},\quad\lambda=0.italic_A ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = - ∇ roman_Ψ , italic_B = italic_C = italic_I start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT , italic_λ = 0 .

The initial states in this experiment are (1,,1)superscript11top(-1,\cdots,-1)^{\top}( - 1 , ⋯ , - 1 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, and the terminal state is set as (1,,1)superscript11top(1,\cdots,1)^{\top}( 1 , ⋯ , 1 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. As for the cost functional, we choose

F=0,D=In×n,FT=diag{ν1,,νi,,νn}formulae-sequence𝐹0formulae-sequence𝐷subscript𝐼𝑛𝑛subscript𝐹𝑇diagsubscript𝜈1subscript𝜈𝑖subscript𝜈𝑛F=0,\quad D=I_{n\times n},\quad F_{T}=\mathrm{diag}\{\nu_{1},\cdots,\nu_{i},% \cdots,\nu_{n}\}italic_F = 0 , italic_D = italic_I start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = roman_diag { italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_ν start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ⋯ , italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }

and the terminal time T=1.0𝑇1.0T=1.0italic_T = 1.0.

Refer to caption
(a) Approximation uNNsuperscript𝑢NNu^{\textrm{NN}}italic_u start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT
Refer to caption
(b) Baseline usuperscript𝑢u^{\ast}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
Figure 4: Optimal control
Refer to caption
Figure 5: Absolute error |uNNu|superscript𝑢NNsuperscript𝑢|u^{\textrm{NN}}-u^{\ast}|| italic_u start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT - italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT |
Refer to caption
Figure 6: Original potential and optimal potential.

Firstly, we study the one-dimensional setting, choosing κ=3𝜅3\kappa=3italic_κ = 3, ν=1𝜈1\nu=1italic_ν = 1. Figure 4 displays the approximation uNNsuperscript𝑢NNu^{\textrm{NN}}italic_u start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT of the optimal control computed by the DeepHJB solver and the baseline usuperscript𝑢u^{\ast}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT obtained by a finite difference method. The absolute error between them can be seen in Figure 5. It is clear that the approximation is in close agreement with the baseline. Figure 6 demonstrates the growth of the potential function from an original potential to the optimal potential.

Let us next consider the high-dimensional case, that is, n=5𝑛5n=5italic_n = 5. In particular, we set κi=1.2subscript𝜅𝑖1.2\kappa_{i}=1.2italic_κ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1.2, νi=1subscript𝜈𝑖1\nu_{i}=1italic_ν start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 for i{1,2,3}𝑖123i\in\{1,2,3\}italic_i ∈ { 1 , 2 , 3 } and κi=1subscript𝜅𝑖1\kappa_{i}=1italic_κ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1, νi=1subscript𝜈𝑖1\nu_{i}=1italic_ν start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 for i{4,5}𝑖45i\in\{4,5\}italic_i ∈ { 4 , 5 } . As can be seen, Figure 7 shows two components of the five dimensional approximated optimal control uNNsuperscript𝑢NNu^{\textrm{NN}}italic_u start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT as well as the baseline usuperscript𝑢u^{\ast}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, which indicates a good match and illustrates the efficacy of our DeepHJB solver for solving a high dimensional nonlinear SOC problem.

Refer to caption
Figure 7: Comparison of uNNsuperscript𝑢NNu^{\textrm{NN}}italic_u start_POSTSUPERSCRIPT NN end_POSTSUPERSCRIPT and usuperscript𝑢u^{\ast}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT at different times.

5 Conclusion

In this paper, we proposed the DeepHJB solver to study the finite time horizon SOC problems for a class of dynamical systems. Although these numerical experiments in this work demonstrate the efficacy of the solver, it still has plenty of room for development. From the viewpoint of theoretical analysis, our future research will be devoted to connecting Lyapunov analysis with the DeepHJB solver, and doing error analysis for the solver. Moreover, we will also exploit the present solver to investigate more control problems of high-dimensional nonlinear systems in practical applications.

Appendix A Proof of Theorem 1

From Theorem 5.1 in Chapter 5.5 of [26], we know the fact that an admissible pair (𝐱t,𝐮(t))subscript𝐱𝑡𝐮𝑡(\mathbf{x}_{t},\mathbf{u}(t))( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_u ( italic_t ) ) is optimal is equivalent to the condition that this pair satisfies the following HJB equation

tq(t,𝐱t)=subscript𝑡𝑞𝑡subscript𝐱𝑡absent\displaystyle-\partial_{t}q(t,\mathbf{x}_{t})=- ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_q ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = H(t,𝐱t,𝐮(t),q(t,𝐱t),2q(t,𝐱t))𝐻𝑡subscript𝐱𝑡𝐮𝑡𝑞𝑡subscript𝐱𝑡superscript2𝑞𝑡subscript𝐱𝑡\displaystyle H\left(t,\mathbf{x}_{t},\mathbf{u}(t),\nabla q(t,\mathbf{x}_{t})% ,\nabla^{2}q(t,\mathbf{x}_{t})\right)italic_H ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_u ( italic_t ) , ∇ italic_q ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (7)
=\displaystyle== 12tr[σ(t,𝐱t,𝐮(t))2q(t,𝐱t)σ(t,𝐱t,𝐮(t))]12trdelimited-[]𝜎superscript𝑡subscript𝐱𝑡𝐮𝑡topsuperscript2𝑞𝑡subscript𝐱𝑡𝜎𝑡subscript𝐱𝑡𝐮𝑡\displaystyle\frac{1}{2}\mathrm{tr}\left[\sigma(t,\mathbf{x}_{t},\mathbf{u}(t)% )^{\top}\nabla^{2}q(t,\mathbf{x}_{t})\sigma(t,\mathbf{x}_{t},\mathbf{u}(t))\right]divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_tr [ italic_σ ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_u ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_σ ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_u ( italic_t ) ) ]
+b(t,𝐱t,𝐮(t))q(t,𝐱t)+ϕ(t,𝐱t,𝐮(t))𝑏𝑡subscript𝐱𝑡𝐮𝑡𝑞𝑡subscript𝐱𝑡italic-ϕ𝑡subscript𝐱𝑡𝐮𝑡\displaystyle+b(t,\mathbf{x}_{t},\mathbf{u}(t))\cdot\nabla q(t,\mathbf{x}_{t})% +\phi(t,\mathbf{x}_{t},\mathbf{u}(t))+ italic_b ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_u ( italic_t ) ) ⋅ ∇ italic_q ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_ϕ ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_u ( italic_t ) )

and

𝐮(t)=argminu𝒰{12tr[σ2qσ]+bq+ϕ}.𝐮𝑡subscriptu𝒰12trdelimited-[]superscript𝜎topsuperscript2𝑞𝜎𝑏𝑞italic-ϕ\mathbf{u}(t)=\mathop{\arg\min}\limits_{\mathrm{u}\in\mathcal{U}}\Big{\{}\frac% {1}{2}\mathrm{tr}\left[\sigma^{\top}\nabla^{2}q\sigma\right]+b\cdot\nabla q+% \phi\Big{\}}.bold_u ( italic_t ) = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT roman_u ∈ caligraphic_U end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_tr [ italic_σ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q italic_σ ] + italic_b ⋅ ∇ italic_q + italic_ϕ } .

Due to the specific expression of b𝑏bitalic_b, σ𝜎\sigmaitalic_σ and ϕitalic-ϕ\phiitalic_ϕ, we have

𝐮(t)=argminu𝒰Λ(u)𝐮𝑡subscript𝑢𝒰Λ𝑢\mathbf{u}(t)=\mathop{\arg\min}\limits_{u\in\mathcal{U}}\Lambda(u)bold_u ( italic_t ) = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_u ∈ caligraphic_U end_POSTSUBSCRIPT roman_Λ ( italic_u )

with

Λ(u):=assignΛ𝑢absent\displaystyle\Lambda(u):=roman_Λ ( italic_u ) := 12λ2tr[(B(t,𝐱t)u(t))2q(t,𝐱t)B(t,𝐱t)u(t)]12superscript𝜆2trdelimited-[]superscript𝐵𝑡subscript𝐱𝑡𝑢𝑡topsuperscript2𝑞𝑡subscript𝐱𝑡𝐵𝑡subscript𝐱𝑡𝑢𝑡\displaystyle\frac{1}{2}\lambda^{2}\mathrm{tr}\left[(B(t,\mathbf{x}_{t})u(t))^% {\top}\nabla^{2}q(t,\mathbf{x}_{t})B(t,\mathbf{x}_{t})u(t)\right]divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_tr [ ( italic_B ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_u ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_B ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_u ( italic_t ) ]
+B(t,𝐱t)u(t)q(t,𝐱t)+12u(t)Du(t).𝐵𝑡subscript𝐱𝑡𝑢𝑡𝑞𝑡subscript𝐱𝑡12𝑢superscript𝑡top𝐷𝑢𝑡\displaystyle+B(t,\mathbf{x}_{t})u(t)\cdot\nabla q(t,\mathbf{x}_{t})+\frac{1}{% 2}u(t)^{\top}Du(t).+ italic_B ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_u ( italic_t ) ⋅ ∇ italic_q ( italic_t , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_u ( italic_t ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D italic_u ( italic_t ) .

Since we have dΛdu|u=u=0evaluated-at𝑑Λ𝑑𝑢𝑢u0\frac{d\Lambda}{du}\big{|}_{u=\textbf{u}}=0divide start_ARG italic_d roman_Λ end_ARG start_ARG italic_d italic_u end_ARG | start_POSTSUBSCRIPT italic_u = u end_POSTSUBSCRIPT = 0, that is,

λ2B2qBu+Bq+Du=0,superscript𝜆2superscript𝐵topsuperscript2𝑞𝐵usuperscript𝐵top𝑞𝐷u0\lambda^{2}B^{\top}\nabla^{2}qB\textbf{u}+B^{\top}\nabla q+D\textbf{u}=0,italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q italic_B u + italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_q + italic_D u = 0 ,

then we have

u=(D+λ2B2qB)1Bq.usuperscript𝐷superscript𝜆2superscript𝐵topsuperscript2𝑞𝐵1superscript𝐵top𝑞\textbf{u}=-(D+\lambda^{2}B^{\top}\nabla^{2}qB)^{-1}B^{\top}\nabla q.u = - ( italic_D + italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_q . (8)

Plugging the expression of the optimal control (8) into equation (7), we obtain the desired equation (4).

Appendix B Experiment configuration

We introduce the following fully connected feedforward neural network

z(1)(x,θ)superscript𝑧1𝑥𝜃\displaystyle z^{(1)}(x,\theta)italic_z start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x , italic_θ ) =W(1)x+b(1),absentsuperscript𝑊1𝑥superscript𝑏1\displaystyle=W^{(1)}x+b^{(1)},= italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ,
z¯(l)(x,θ)superscript¯𝑧𝑙𝑥𝜃\displaystyle\bar{z}^{(l)}(x,\theta)over¯ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_x , italic_θ ) =σ(z(l)(x,θ)),l=1,2,,L1,formulae-sequenceabsent𝜎superscript𝑧𝑙𝑥𝜃𝑙12𝐿1\displaystyle=\sigma(z^{(l)}(x,\theta)),\quad l=1,2,\cdots,L-1,= italic_σ ( italic_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_x , italic_θ ) ) , italic_l = 1 , 2 , ⋯ , italic_L - 1 ,
z(l+1)(x,θ)superscript𝑧𝑙1𝑥𝜃\displaystyle z^{(l+1)}(x,\theta)italic_z start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT ( italic_x , italic_θ ) =W(l+1)z¯(l)(x,θ)+b(l+1),l=1,2,,L1,formulae-sequenceabsentsuperscript𝑊𝑙1superscript¯𝑧𝑙𝑥𝜃superscript𝑏𝑙1𝑙12𝐿1\displaystyle=W^{(l+1)}\bar{z}^{(l)}(x,\theta)+b^{(l+1)},\quad l=1,2,\cdots,L-1,= italic_W start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT over¯ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_x , italic_θ ) + italic_b start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT , italic_l = 1 , 2 , ⋯ , italic_L - 1 ,

where we refer to σ::𝜎\sigma:\mathbb{R}\rightarrow\mathbb{R}italic_σ : blackboard_R → blackboard_R as the activation function, to L𝐿Litalic_L as the number of layers, and to N0subscript𝑁0N_{0}italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, NLsubscript𝑁𝐿N_{L}italic_N start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, and Nlsubscript𝑁𝑙N_{l}italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT as he number of neurons in the input, output, and l𝑙litalic_l-th hidden layer, respectively. We denote by 𝒜=(N,σ)𝒜𝑁𝜎\mathcal{A}=(N,\sigma)caligraphic_A = ( italic_N , italic_σ ), N=(N0,N1,,NL)L+1𝑁subscript𝑁0subscript𝑁1subscript𝑁𝐿superscript𝐿1N=(N_{0},N_{1},\cdots,N_{L})\in\mathbb{N}^{L+1}italic_N = ( italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_N start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) ∈ blackboard_N start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT, the architecture of the neural network.

The computational framework for our numerical examples is conducted by the following architecture.

  • 1.

    Controlled OU dynamics in Section 4.1.

    • (a)

      For the linear terminal cost, the architecture is given by

      𝒜1=((31,32,32,32,1),tanh),𝒜2=((31,32,32,32,30),tanh),formulae-sequencesubscript𝒜1313232321subscript𝒜23132323230\begin{split}\mathcal{A}_{1}&=((31,32,32,32,1),\tanh),\\ \mathcal{A}_{2}&=((31,32,32,32,30),\tanh),\end{split}start_ROW start_CELL caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL = ( ( 31 , 32 , 32 , 32 , 1 ) , roman_tanh ) , end_CELL end_ROW start_ROW start_CELL caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL = ( ( 31 , 32 , 32 , 32 , 30 ) , roman_tanh ) , end_CELL end_ROW

      while for the quadratic terminal cost it is given by

      𝒜1=((16,64,64,64,1),tanh),𝒜2=((16,64,64,64,15),tanh).formulae-sequencesubscript𝒜1166464641subscript𝒜21664646415\begin{split}\mathcal{A}_{1}&=((16,64,64,64,1),\tanh),\\ \mathcal{A}_{2}&=((16,64,64,64,15),\tanh).\end{split}start_ROW start_CELL caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL = ( ( 16 , 64 , 64 , 64 , 1 ) , roman_tanh ) , end_CELL end_ROW start_ROW start_CELL caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL = ( ( 16 , 64 , 64 , 64 , 15 ) , roman_tanh ) . end_CELL end_ROW
  • 2.

    Controlled metastable dynamics in Section 4.2.

    • (a)

      For one-dimensional case, the architecture is given by

      𝒜1=((2,128,128,128,128,1),tanh),𝒜2=((2,128,128,128,128,1),tanh),formulae-sequencesubscript𝒜121281281281281subscript𝒜221281281281281\begin{split}\mathcal{A}_{1}&=((2,128,128,128,128,1),\tanh),\\ \mathcal{A}_{2}&=((2,128,128,128,128,1),\tanh),\end{split}start_ROW start_CELL caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL = ( ( 2 , 128 , 128 , 128 , 128 , 1 ) , roman_tanh ) , end_CELL end_ROW start_ROW start_CELL caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL = ( ( 2 , 128 , 128 , 128 , 128 , 1 ) , roman_tanh ) , end_CELL end_ROW

      while for ten-dimensional case it is given by

      𝒜1=((6,128,128,128,128,1),tanh),𝒜2=((6,128,128,128,128,5),tanh).formulae-sequencesubscript𝒜161281281281281subscript𝒜261281281281285\begin{split}\mathcal{A}_{1}&=((6,128,128,128,128,1),\tanh),\\ \mathcal{A}_{2}&=((6,128,128,128,128,5),\tanh).\end{split}start_ROW start_CELL caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL = ( ( 6 , 128 , 128 , 128 , 128 , 1 ) , roman_tanh ) , end_CELL end_ROW start_ROW start_CELL caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL = ( ( 6 , 128 , 128 , 128 , 128 , 5 ) , roman_tanh ) . end_CELL end_ROW

The computing device that we use for our solver includes a single NVIDIA GeForce RTX 2080Ti GPU with 11GB memory. Codes will be publicly available at https://github.com/zhezhejiao/DeepHJB after being accepted.

References

  • [1] H. Pham, Continuous-time stochastic control and optimization with financial applications, Vol. 61, Springer Science & Business Media, 2009.
  • [2] Y. Gao, T. Li, X. Li, J.-G. Liu, Transition path theory for langevin dynamics on manifolds: Optimal control and data-driven solver, Multiscale Modeling & Simulation 21 (1) (2023) 1–33.
  • [3] E. Todorov, Optimality principles in sensorimotor control, Nature Neuroscience 7 (2004) 907–915.
  • [4] T. Russ, Robotic Manipulation: Perception, Planning, and Control, Draft textbook, 2023.
  • [5] L. S. Pontrygin, Mathematical Theory of Optimal Processes, CRC Press, 1987.
  • [6] R. Bellman, Dynamic programming and stochastic control processes, Information and Control 1 (3) (1958) 228–239.
  • [7] H. J. Kushner, Numerical methods for stochastic control problems in continuous time, SIAM Journal on Control and Optimization 28 (5) (1990) 999–1048.
  • [8] Z. Jin, M. Qiu, K. Q. Tran, G. Yin, A survey of numerical solutions for stochastic control problems: Some recent progress, Numerical Algebra, Control and Optimization 12 (2) (2022) 213–253.
  • [9] I. Exarchos, E. A. Theodorou, Stochastic optimal control via forward and backward stochastic differential equations and importance sampling, Automatica 87 (2018) 159–165.
  • [10] A. Gorodetsky, S. Karaman, Y. Marzouk, Efficient high-dimensional stochastic optimal motion control using tensor-train decomposition., in: Robotics: Science and Systems, 2015.
  • [11] J. Han, W. E, Deep learning approximation for stochastic control problems, in: Deep Reinforcement Learning Workshop, 2016.
  • [12] Z. Wang, M. Pereira, T. Chen, E. Theodorou, E. Reed, Deep 2FBSDEs for systems with control multiplicative noise, arXiv:1906.04762.
  • [13] X. Li, D. Verma, L. Ruthotto, A neural network approach for stochastic optimal control, SIAM Journal on Scientific Computing 46 (5) (2024) C535–C556.
  • [14] W. Cai, S. Fang, T. Zhou, Soc-Martnet: A martingale neural network for the Hamilton-Jacobi-Bellman equation without explicit inf H in stochastic optimal controls, arXiv preprint arXiv:2405.03169.
  • [15] J.-P. Fouque, Z. Zhang, Deep learning methods for mean field control problems with delay, Frontiers in Applied Mathematics and Statistics 6 (2020) 11.
  • [16] R. Carmona, M. Lauriére, Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games II: the finite horizon case, Annals of Applied Probability 32 (6) (2022) 4065–4105.
  • [17] S. Jin, S. Peng, Y. Peng, X. Zhang, Solving stochastic optimal control problem via stochastic maximum principle with deep learning method, Journal of Scientific Computing 93 (2022) 30.
  • [18] C. Huré, H. Pham, A. Bachouch, N. Langrené, Deep neural networks algorithms for stochastic control problems on finite horizon: Convergence analysis, SIAM Journal on Numerical Analysis 59 (1) (2021) 525–557.
  • [19] A. Bachouch, C. Huré, N. Langrené, H. Pham, Deep neural networks algorithms for stochastic control problems on finite horizon: Numerical applications, Methodology and Computing in Applied Probability 24 (1) (2022) 143–178.
  • [20] M. Pereira, Z. Wang, T. Chen, E. Reed, E. Theodorou, Feynman-Kac neural network architectures for stochastic control using second-order fbsde theory, in: Proceedings of the 2nd Conference on Learning for Dynamics and Control, 2020.
  • [21] N. Nüsken, L. Richter, Solving high-dimensional Hamilton–Jacobi–Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space, Partial Differential Equations and Applications 2 (4) (2021) 48.
  • [22] M. Hua, M. Laurière, E. Vanden-Eijnden, A simulation-free deep learning approach to stochastic optimal control, arXiv preprint arXiv:2410.05163.
  • [23] M. Raissi, P. Perdikaris, G. E. Karniadakis, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational Physics 378 (2019) 686–707.
  • [24] G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, L. Yang, Physics-informed machine learning, Nature Reviews Physics 3 (6) (2021) 422–440.
  • [25] K. Chung, R. Williams, Introduction to Stochastic Integration, Springer, 2013.
  • [26] J. Yong, X. Y. Zhou, Stochastic Controls, Springer, 1991.
  • [27] P. E. Kloeden, E. Platen, Numerical Solution of Stochastic Differential Equations, Springer, 1999.