Skip to content

Commit f4bc8ef

Browse files
author
Ervin T
authored
Refactor reward signals into separate class (#2144)
* Create new class (RewardSignal) that represents a reward signal. * Add value heads for each reward signal in the PPO model. * Make summaries agnostic to the type of reward signals, and log weighted rewards per reward signal. * Move extrinsic and curiosity rewards into this new structure. * Allow defining multiple reward signals in YAML file. Add documentation for this new structure.
1 parent ed1b76e commit f4bc8ef

23 files changed

+1303
-637
lines changed

config/trainer_config.yaml

Lines changed: 38 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ default:
44
beta: 5.0e-3
55
buffer_size: 10240
66
epsilon: 0.2
7-
gamma: 0.99
87
hidden_units: 128
98
lambd: 0.95
109
learning_rate: 3.0e-4
@@ -17,14 +16,15 @@ default:
1716
sequence_length: 64
1817
summary_freq: 1000
1918
use_recurrent: false
20-
use_curiosity: false
21-
curiosity_strength: 0.01
22-
curiosity_enc_size: 128
19+
reward_signals:
20+
extrinsic:
21+
strength: 1.0
22+
gamma: 0.99
2323

2424
BananaLearning:
2525
normalize: false
26-
batch_size: 1024
2726
beta: 5.0e-3
27+
batch_size: 1024
2828
buffer_size: 10240
2929
max_steps: 1.0e5
3030

@@ -93,10 +93,7 @@ GoalieLearning:
9393
normalize: false
9494

9595
PyramidsLearning:
96-
use_curiosity: true
9796
summary_freq: 2000
98-
curiosity_strength: 0.01
99-
curiosity_enc_size: 256
10097
time_horizon: 128
10198
batch_size: 128
10299
buffer_size: 2048
@@ -105,11 +102,16 @@ PyramidsLearning:
105102
beta: 1.0e-2
106103
max_steps: 5.0e5
107104
num_epoch: 3
108-
105+
reward_signals:
106+
extrinsic:
107+
strength: 1.0
108+
gamma: 0.99
109+
curiosity:
110+
strength: 0.02
111+
gamma: 0.99
112+
encoding_size: 256
113+
109114
VisualPyramidsLearning:
110-
use_curiosity: true
111-
curiosity_strength: 0.01
112-
curiosity_enc_size: 256
113115
time_horizon: 128
114116
batch_size: 64
115117
buffer_size: 2024
@@ -118,6 +120,14 @@ VisualPyramidsLearning:
118120
beta: 1.0e-2
119121
max_steps: 5.0e5
120122
num_epoch: 3
123+
reward_signals:
124+
extrinsic:
125+
strength: 1.0
126+
gamma: 0.99
127+
curiosity:
128+
strength: 0.01
129+
gamma: 0.99
130+
encoding_size: 256
121131

122132
3DBallLearning:
123133
normalize: true
@@ -126,7 +136,6 @@ VisualPyramidsLearning:
126136
summary_freq: 1000
127137
time_horizon: 1000
128138
lambd: 0.99
129-
gamma: 0.995
130139
beta: 0.001
131140

132141
3DBallHardLearning:
@@ -136,8 +145,11 @@ VisualPyramidsLearning:
136145
summary_freq: 1000
137146
time_horizon: 1000
138147
max_steps: 5.0e5
139-
gamma: 0.995
140148
beta: 0.001
149+
reward_signals:
150+
extrinsic:
151+
strength: 1.0
152+
gamma: 0.995
141153

142154
TennisLearning:
143155
normalize: true
@@ -149,35 +161,44 @@ CrawlerStaticLearning:
149161
time_horizon: 1000
150162
batch_size: 2024
151163
buffer_size: 20240
152-
gamma: 0.995
153164
max_steps: 1e6
154165
summary_freq: 3000
155166
num_layers: 3
156167
hidden_units: 512
168+
reward_signals:
169+
extrinsic:
170+
strength: 1.0
171+
gamma: 0.995
157172

158173
CrawlerDynamicLearning:
159174
normalize: true
160175
num_epoch: 3
161176
time_horizon: 1000
162177
batch_size: 2024
163178
buffer_size: 20240
164-
gamma: 0.995
165179
max_steps: 1e6
166180
summary_freq: 3000
167181
num_layers: 3
168182
hidden_units: 512
183+
reward_signals:
184+
extrinsic:
185+
strength: 1.0
186+
gamma: 0.995
169187

170188
WalkerLearning:
171189
normalize: true
172190
num_epoch: 3
173191
time_horizon: 1000
174192
batch_size: 2048
175193
buffer_size: 20480
176-
gamma: 0.995
177194
max_steps: 2e6
178195
summary_freq: 3000
179196
num_layers: 3
180197
hidden_units: 512
198+
reward_signals:
199+
extrinsic:
200+
strength: 1.0
201+
gamma: 0.995
181202

182203
ReacherLearning:
183204
normalize: true
@@ -196,7 +217,6 @@ HallwayLearning:
196217
hidden_units: 128
197218
memory_size: 256
198219
beta: 1.0e-2
199-
gamma: 0.99
200220
num_epoch: 3
201221
buffer_size: 1024
202222
batch_size: 128

docs/Training-PPO.md

Lines changed: 14 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ observations to the best action an agent can take in a given state. The
77
ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate
88
Python process (communicating with the running Unity application over a socket).
99

10+
To train an agent, you will need to provide the agent one or more reward signals which
11+
the agent should attempt to maximize. See [Reward Signals](Training-RewardSignals.md)
12+
for the available reward signals and the corresponding hyperparameters.
13+
1014
See [Training ML-Agents](Training-ML-Agents.md) for instructions on running the
1115
training program, `learn.py`.
1216

@@ -31,15 +35,18 @@ of performance you would like.
3135

3236
## Hyperparameters
3337

34-
### Gamma
38+
### Reward Signals
3539

36-
`gamma` corresponds to the discount factor for future rewards. This can be
37-
thought of as how far into the future the agent should care about possible
38-
rewards. In situations when the agent should be acting in the present in order
39-
to prepare for rewards in the distant future, this value should be large. In
40-
cases when rewards are more immediate, it can be smaller.
40+
In reinforcement learning, the goal is to learn a Policy that maximizes reward.
41+
At a base level, the reward is given by the environment. However, we could imagine
42+
rewarding the agent for various different behaviors. For instance, we could reward
43+
the agent for exploring new states, rather than just when an explicit reward is given.
44+
Furthermore, we could mix reward signals to help the learning process.
4145

42-
Typical Range: `0.8` - `0.995`
46+
`reward_signals` provides a section to define [reward signals.](Training-RewardSignals.md)
47+
ML-Agents provides two reward signals by default, the Extrinsic (environment) reward, and the
48+
Curiosity reward, which can be used to encourage exploration in sparse extrinsic reward
49+
environments.
4350

4451
### Lambda
4552

@@ -184,30 +191,6 @@ the agent will need to remember in order to successfully complete the task.
184191

185192
Typical Range: `64` - `512`
186193

187-
## (Optional) Intrinsic Curiosity Module Hyperparameters
188-
189-
The below hyperparameters are only used when `use_curiosity` is set to true.
190-
191-
### Curiosity Encoding Size
192-
193-
`curiosity_enc_size` corresponds to the size of the hidden layer used to encode
194-
the observations within the intrinsic curiosity module. This value should be
195-
small enough to encourage the curiosity module to compress the original
196-
observation, but also not too small to prevent it from learning the dynamics of
197-
the environment.
198-
199-
Typical Range: `64` - `256`
200-
201-
### Curiosity Strength
202-
203-
`curiosity_strength` corresponds to the magnitude of the intrinsic reward
204-
generated by the intrinsic curiosity module. This should be scaled in order to
205-
ensure it is large enough to not be overwhelmed by extrinsic reward signals in
206-
the environment. Likewise it should not be too large to overwhelm the extrinsic
207-
reward signal.
208-
209-
Typical Range: `0.1` - `0.001`
210-
211194
## Training Statistics
212195

213196
To view training statistics, use TensorBoard. For information on launching and

docs/Training-RewardSignals.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Reward Signals
2+
3+
In reinforcement learning, the end goal for the Agent is to discover a behavior (a Policy)
4+
that maximizes a reward. Typically, a reward is defined by your environment, and corresponds
5+
to reaching some goal. These are what we refer to as "extrinsic" rewards, as they are defined
6+
external of the learning algorithm.
7+
8+
Rewards, however, can be defined outside of the enviroment as well, to encourage the agent to
9+
behave in certain ways, or to aid the learning of the true extrinsic reward. We refer to these
10+
rewards as "intrinsic" reward signals. The total reward that the agent will learn to maximize can
11+
be a mix of extrinsic and intrinsic reward signals.
12+
13+
ML-Agents allows reward signals to be defined in a modular way, and we provide three reward
14+
signals that can the mixed and matched to help shape your agent's behavior. The `extrinsic` Reward
15+
Signal represents the rewards defined in your environment, and is enabled by default.
16+
The `curiosity` reward signal helps your agent explore when extrinsic rewards are sparse.
17+
18+
## Enabling Reward Signals
19+
20+
Reward signals, like other hyperparameters, are defined in the trainer config `.yaml` file. An
21+
example is provided in `config/trainer_config.yaml`. To enable a reward signal, add it to the
22+
`reward_signals:` section under the brain name. For instance, to enable the extrinsic signal
23+
in addition to a small curiosity reward, you would define your `reward_signals` as follows:
24+
25+
```yaml
26+
reward_signals:
27+
extrinsic:
28+
strength: 1.0
29+
gamma: 0.99
30+
curiosity:
31+
strength: 0.01
32+
gamma: 0.99
33+
encoding_size: 128
34+
```
35+
36+
Each reward signal should define at least two parameters, `strength` and `gamma`, in addition
37+
to any class-specific hyperparameters. Note that to remove a reward signal, you should delete
38+
its entry entirely from `reward_signals`. At least one reward signal should be left defined
39+
at all times.
40+
41+
## Reward Signal Types
42+
43+
### The Extrinsic Reward Signal
44+
45+
The `extrinsic` reward signal is simply the reward given by the
46+
[environment](Learning-Environment-Design.md). Remove it to force the agent
47+
to ignore the environment reward.
48+
49+
#### Strength
50+
51+
`strength` is the factor by which to multiply the raw
52+
reward. Typical ranges will vary depending on the reward signal.
53+
54+
Typical Range: `1.0`
55+
56+
#### Gamma
57+
58+
`gamma` corresponds to the discount factor for future rewards. This can be
59+
thought of as how far into the future the agent should care about possible
60+
rewards. In situations when the agent should be acting in the present in order
61+
to prepare for rewards in the distant future, this value should be large. In
62+
cases when rewards are more immediate, it can be smaller.
63+
64+
Typical Range: `0.8` - `0.995`
65+
66+
### The Curiosity Reward Signal
67+
68+
The `curiosity` Reward Signal enables the Intrinsic Curiosity Module. This is an implementation
69+
of the approach described in "Curiosity-driven Exploration by Self-supervised Prediction"
70+
by Pathak, et al. It trains two networks:
71+
* an inverse model, which takes the current and next obersvation of the agent, encodes them, and
72+
uses the encoding to predict the action that was taken between the observations
73+
* a forward model, which takes the encoded current obseravation and action, and predicts the
74+
next encoded observation.
75+
76+
The loss of the forward model (the difference between the predicted and actual encoded observations) is used as the intrinsic reward, so the more surprised the model is, the larger the reward will be.
77+
78+
For more information, see
79+
* https://arxiv.org/abs/1705.05363
80+
* https://pathak22.github.io/noreward-rl/
81+
* https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/
82+
83+
#### Strength
84+
85+
In this case, `strength` corresponds to the magnitude of the curiosity reward generated
86+
by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough
87+
to not be overwhelmed by extrinsic reward signals in the environment.
88+
Likewise it should not be too large to overwhelm the extrinsic reward signal.
89+
90+
Typical Range: `0.001` - `0.1`
91+
92+
#### Gamma
93+
94+
`gamma` corresponds to the discount factor for future rewards.
95+
96+
Typical Range: `0.8` - `0.995`
97+
98+
#### Encoding Size
99+
100+
`encoding_size` corresponds to the size of the encoding used by the intrinsic curiosity model.
101+
This value should be small enough to encourage the ICM to compress the original
102+
observation, but also not too small to prevent it from learning to differentiate between
103+
demonstrated and actual behavior.
104+
105+
Default Value: 64
106+
Typical Range: `64` - `256`
107+
108+
#### Learning Rate
109+
110+
`learning_rate` is the learning rate used to update the intrinsic curiosity module.
111+
This should typically be decreased if training is unstable, and the curiosity loss is unstable.
112+
113+
Default Value: `3e-4`
114+
Typical Range: `1e-5` - `1e-3`

ml-agents/mlagents/trainers/bc/policy.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ def evaluate(self, brain_info):
5757
self.model.sequence_length: 1,
5858
}
5959

60-
feed_dict = self._fill_eval_dict(feed_dict, brain_info)
60+
feed_dict = self.fill_eval_dict(feed_dict, brain_info)
6161
if self.use_recurrent:
6262
if brain_info.memories.shape[1] == 0:
6363
brain_info.memories = self.make_empty_memory(len(brain_info.agents))

ml-agents/mlagents/trainers/components/__init__.py

Whitespace-only changes.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
from .reward_signal import *
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
from .signal import CuriosityRewardSignal

0 commit comments

Comments
 (0)