St+1 ~ P( ′s | St ,At )
rt+1 = r(St ,At ,St+1)
At ~ π( ′a | St )
St+1 ~ P( ′s | St ,At )
rt+1 = r(St ,At ,St+1)
At ~ π( ′a | St )
π∗
= argmax
π
Eπ [ γ τ
rτ ]
τ =0
∞
∑
π∗
= argmax
π
Eπ [ γ τ
rτ ]
τ =0
∞
∑
= J
∇θ J
∇θ J = Eπθ
[∇θ log(πθ (at | st ))Qt ]
∇θ J = Es∼ρ ∇aQµ
s,a( )a=µθ s( )
∇θ µθ s( )⎡
⎣⎢
⎤
⎦⎥
∇θ J = ∇θ Eπθ
[ γ τ
rτ ]
τ =0
∞
∑
= ∇θ Es0 ~ρ,s'~p πθ at ,st( ) γ τ
rτ
τ =0
∞
∑t=0
∏
⎡
⎣
⎢
⎤
⎦
⎥
= Es0 ~ρ,s'~p ∇θ πθ at ,st( ) γ τ
rτ
τ =0
∞
∑t=0
∏
⎡
⎣
⎢
⎤
⎦
⎥
= Es~ρ πθ at ,st( )
∇θ πθ at ,st( )
t=0
∏
πθ at ,st( )
t=0
∏
γ τ
rτ
τ =0
∞
∑
t=0
∏
⎡
⎣
⎢
⎢
⎢
⎤
⎦
⎥
⎥
⎥
= Es~ρ πθ (at | st ) ∇θ log(πθ (at | st ))
t=0
∑t=0
∏ γ τ
rτ
τ =0
∞
∑
⎡
⎣
⎢
⎤
⎦
⎥
= Eπθ
[ ∇θ log(πθ (at | st ))
t=0
∑ γ τ
rτ
τ =t
∞
∑ ]
∇log p x( )( ) f x( )
∇log p x( )( ) f x( )
J = Es∼ρ [Qµθ
s,µθ s( )( )]
∇θ J = Es∼ρ ∇θQµ
s,µθ s( )( )⎡⎣ ⎤⎦
= Es∼ρ ∇aQµ
s,a( )a=µθ s( )
∇θ µθ s( )⎡
⎣⎢
⎤
⎦⎥
f st ,at( )= f st ,at( )+ ∇a f st ,a( )a=at
at − at( )
∇θ J = Eρ,π ∇θ logπθ at st( ) Q st ,at( )− f st ,at( )( )⎡
⎣
⎤
⎦ + Eρ,π ∇θ logπθ at st( ) f st ,at( )⎡
⎣
⎤
⎦
= Eρ,π ∇θ logπθ at st( ) Q st ,at( )− f st ,at( )( )⎡
⎣
⎤
⎦ + Eρ,π ∇a f st ,a( )a=at
∇θ µθ st( )⎡
⎣
⎤
⎦
∇θ J = Eρ,π ∇θ logπθ at st( ) Q st ,at( )−Qw st ,at( )( )⎡
⎣
⎤
⎦ + Eρ,π ∇aQw st ,a( )a=at
∇θ µθ st( )⎡
⎣
⎤
⎦
∇θ J = Eρ,π ∇θ logπθ at st( ) A st ,at( )− Aw st ,at( )( )⎡
⎣
⎤
⎦ + Eρ,π ∇aQw st ,a( )a=at
∇θ µθ st( )⎡
⎣
⎤
⎦
a
∇θ J = Eρ,π ∇θ logπθ at st( ) A st ,at( )− Aw st ,at( )( )⎡
⎣
⎤
⎦ + Eρ,π ∇aQw st ,a( )a=at
∇θ µθ st( )⎡
⎣
⎤
⎦
Aw = Qw st ,at( )− Eπ Qw st ,at( )⎡⎣ ⎤⎦
= Qw st ,µθ st( )( )+ ∇aQw st ,a( )a=µθ st( )
at − µθ st( )( )− Eπ Qw st ,µθ st( )( )+ ∇aQw st ,a( )a=µθ st( )
at − µθ st( )( )⎡
⎣⎢
⎤
⎦⎥
= ∇aQw st ,a( )a=µθ st( )
at − µθ st( )( )
rt+1 +γV st+1( )−V st( )
Eπ at[ ]= µθ st( )
m*
= m −η(t −τ )
E m*
⎡⎣ ⎤⎦ = E m[ ]
Var m*
⎡⎣ ⎤⎦ = Var m[ ]− 2ηCov m,t[ ]+η2
Var t[ ]
η*
=
Cov m,t[ ]
Var t[ ]
∇θ J = Eρ,π ∇θ logπθ at st( ) A st ,at( )−η st( )Aw st ,at( )( )⎡
⎣
⎤
⎦ +
Eρ,π η st( )∇aQw st ,a( )a=at
∇θ µθ st( )⎡
⎣
⎤
⎦
Var A −ηAw⎡⎣ ⎤⎦ = Var A[ ]− 2ηCov A,Aw( )+η2
Var Aw( )
η*
=
Cov A,Aw( )
Var Aw( )
Q prop
Q prop
Q prop
Q prop
Q prop
Q prop
Q prop
Q prop

Q prop

  • 6.
    St+1 ~ P(′s | St ,At ) rt+1 = r(St ,At ,St+1) At ~ π( ′a | St )
  • 7.
    St+1 ~ P(′s | St ,At ) rt+1 = r(St ,At ,St+1) At ~ π( ′a | St ) π∗ = argmax π Eπ [ γ τ rτ ] τ =0 ∞ ∑
  • 8.
    π∗ = argmax π Eπ [γ τ rτ ] τ =0 ∞ ∑ = J ∇θ J
  • 9.
    ∇θ J =Eπθ [∇θ log(πθ (at | st ))Qt ] ∇θ J = Es∼ρ ∇aQµ s,a( )a=µθ s( ) ∇θ µθ s( )⎡ ⎣⎢ ⎤ ⎦⎥
  • 10.
    ∇θ J =∇θ Eπθ [ γ τ rτ ] τ =0 ∞ ∑ = ∇θ Es0 ~ρ,s'~p πθ at ,st( ) γ τ rτ τ =0 ∞ ∑t=0 ∏ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = Es0 ~ρ,s'~p ∇θ πθ at ,st( ) γ τ rτ τ =0 ∞ ∑t=0 ∏ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = Es~ρ πθ at ,st( ) ∇θ πθ at ,st( ) t=0 ∏ πθ at ,st( ) t=0 ∏ γ τ rτ τ =0 ∞ ∑ t=0 ∏ ⎡ ⎣ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ = Es~ρ πθ (at | st ) ∇θ log(πθ (at | st )) t=0 ∑t=0 ∏ γ τ rτ τ =0 ∞ ∑ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = Eπθ [ ∇θ log(πθ (at | st )) t=0 ∑ γ τ rτ τ =t ∞ ∑ ]
  • 11.
    ∇log p x()( ) f x( )
  • 12.
    ∇log p x()( ) f x( )
  • 14.
    J = Es∼ρ[Qµθ s,µθ s( )( )] ∇θ J = Es∼ρ ∇θQµ s,µθ s( )( )⎡⎣ ⎤⎦ = Es∼ρ ∇aQµ s,a( )a=µθ s( ) ∇θ µθ s( )⎡ ⎣⎢ ⎤ ⎦⎥
  • 21.
    f st ,at()= f st ,at( )+ ∇a f st ,a( )a=at at − at( ) ∇θ J = Eρ,π ∇θ logπθ at st( ) Q st ,at( )− f st ,at( )( )⎡ ⎣ ⎤ ⎦ + Eρ,π ∇θ logπθ at st( ) f st ,at( )⎡ ⎣ ⎤ ⎦ = Eρ,π ∇θ logπθ at st( ) Q st ,at( )− f st ,at( )( )⎡ ⎣ ⎤ ⎦ + Eρ,π ∇a f st ,a( )a=at ∇θ µθ st( )⎡ ⎣ ⎤ ⎦
  • 23.
    ∇θ J =Eρ,π ∇θ logπθ at st( ) Q st ,at( )−Qw st ,at( )( )⎡ ⎣ ⎤ ⎦ + Eρ,π ∇aQw st ,a( )a=at ∇θ µθ st( )⎡ ⎣ ⎤ ⎦ ∇θ J = Eρ,π ∇θ logπθ at st( ) A st ,at( )− Aw st ,at( )( )⎡ ⎣ ⎤ ⎦ + Eρ,π ∇aQw st ,a( )a=at ∇θ µθ st( )⎡ ⎣ ⎤ ⎦ a
  • 24.
    ∇θ J =Eρ,π ∇θ logπθ at st( ) A st ,at( )− Aw st ,at( )( )⎡ ⎣ ⎤ ⎦ + Eρ,π ∇aQw st ,a( )a=at ∇θ µθ st( )⎡ ⎣ ⎤ ⎦ Aw = Qw st ,at( )− Eπ Qw st ,at( )⎡⎣ ⎤⎦ = Qw st ,µθ st( )( )+ ∇aQw st ,a( )a=µθ st( ) at − µθ st( )( )− Eπ Qw st ,µθ st( )( )+ ∇aQw st ,a( )a=µθ st( ) at − µθ st( )( )⎡ ⎣⎢ ⎤ ⎦⎥ = ∇aQw st ,a( )a=µθ st( ) at − µθ st( )( ) rt+1 +γV st+1( )−V st( ) Eπ at[ ]= µθ st( )
  • 26.
    m* = m −η(t−τ ) E m* ⎡⎣ ⎤⎦ = E m[ ] Var m* ⎡⎣ ⎤⎦ = Var m[ ]− 2ηCov m,t[ ]+η2 Var t[ ] η* = Cov m,t[ ] Var t[ ]
  • 27.
    ∇θ J =Eρ,π ∇θ logπθ at st( ) A st ,at( )−η st( )Aw st ,at( )( )⎡ ⎣ ⎤ ⎦ + Eρ,π η st( )∇aQw st ,a( )a=at ∇θ µθ st( )⎡ ⎣ ⎤ ⎦ Var A −ηAw⎡⎣ ⎤⎦ = Var A[ ]− 2ηCov A,Aw( )+η2 Var Aw( ) η* = Cov A,Aw( ) Var Aw( )