-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Refactor reward signals into separate class #2144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # ml-agents/mlagents/trainers/models.py
Overall looks pretty good to me. I'd feel better if someone more familiar with the specific models gave it a once-over though. |
@@ -174,11 +190,14 @@ WalkerLearning: | |||
time_horizon: 1000 | |||
batch_size: 2048 | |||
buffer_size: 20480 | |||
gamma: 0.995 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If gamma is removed from here, does this mean that old versions of the config will no longer be compatible. Is this going to break people's stuff (I am okay with it) or is there a fallback ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the old versions of the config aren't compatible. Having gamma
won't break anything, but it will end up using the default gamma
from default_config
. We could auto-assign the gamma
here to the extrinsic gamma, but that would break the abstraction. I guess we'll just have to be careful in the migration guide.
@@ -7,6 +7,10 @@ observations to the best action an agent can take in a given state. The | |||
ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate | |||
Python process (communicating with the running Unity application over a socket). | |||
|
|||
To train an agent, you will need to provide the agent one or more reward signals which | |||
the agent should attempt to maximize. See [Reward Signals](Training-RewardSignals.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the agent will
attempt the maximize." I know RL does not work but try to act like it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will replace with "will learn to maximize"
docs/Training-RewardSignals.md
Outdated
|
||
### The Curiosity Reward Signal | ||
|
||
@chriselion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this line for ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm supposed to write it at some point. @ervteng do you want to leave this empty for now and I'll do it in another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works - let me add a one-liner for this PR so it isn't completely empty in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like this removed / addressed before merge
""" | ||
self.loss = 10 * (0.2 * self.forward_loss + 0.8 * self.inverse_loss) | ||
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate) | ||
self.update_batch = optimizer.minimize(self.loss) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new line ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @vincentpierre, where would you like the new line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I am sorry I was not clear. I think there needs to be an empty line at the end of the document. It is possible that the line is there, just not appearing on github. In which case, ignore this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Black should catch this in CircleCI, but let me double-check. Sometimes VSCode does weird things with Black.
self.policy = policy | ||
self.strength = strength | ||
|
||
def evaluate(self, current_info, next_info): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you specify the return type is RewardSignalResult for clarity.
Also, I am wondering if the scaling of the reward signal should be handled by the Reward signal or the trainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the RewardSignalResult into the comment.
Hmm, we could go either way.
Pros of the current way: Trainer doesn't have to be aware of the strength, "Strength" could be defined differently for different reward signals.
Pros of doing it in the Trainer: more generic, would be much easier to add normalization of rewards in the future if we want to.
return {} | ||
|
||
@classmethod | ||
def check_config(cls, config_dict, param_keys=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be static ? Also, not only used in reward signals I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not static so I can get the cls
name.
But yeah, there is similar logic in the Trainer (though slightly different), though it's weird to call a Trainer function in RewardSignals. I wonder if it's time to make a utils.py
file with all of these common functions? Also some common logic with minibatching and sequence computing might go in this file.
# Make sure we have at least one reward_signal | ||
if not self.trainer_parameters["reward_signals"]: | ||
raise UnityTrainerException( | ||
"No reward signals were defined. At least one must be used with the PPO trainer." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the class name, do not hard code PPO here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
BTW, I plan to make this even more generic, e.g. for printing the trainer_parameters, I'd move it to the base Trainer class and use the class name from there. Probably will come when we add SAC - with this PR I'm trying not to modify too much
docs/Training-RewardSignals.md
Outdated
@@ -64,7 +65,7 @@ Typical Range: `0.8` - `0.995` | |||
|
|||
### The Curiosity Reward Signal | |||
|
|||
@chriselion | |||
The `curiosity` Reward Signal enables the Intrinsic Curiosity Module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to paper ?
which correspond to the agents in a provided next_info. | ||
:BrainInfo next_info: A t+1 BrainInfo. | ||
:return: curr_info: Reconstructed BrainInfo to match agents of next_info. | ||
""" | ||
visual_observations = [[]] | ||
visual_observations = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? We do visual_observations[i].append(...)
below, but never resize this directly. (I think it may have been broken before for >1 visual observations too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was changed in the last commit (with types) - just marking for reference
…nologies/ml-agents into develop-rewardsignalsrefactor
Refactors the reward signals (extrinsic and curiosity) into separate class that inherits from RewardSignal. The Trainer is now reward signal agnostic, and doesn't check the config whether or not each type exists - it just instantiates all of the classes declared.
This is in preparation for additional reward signals (e.g. GAIL) as well as reuse across different trainers.
Also, it is equivalent to the IRL PR but without the new features (GAIL and PreTraining). We're breaking up this PR into two to make it easier to review.