Skip to content

Refactor reward signals into separate class #2144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 144 commits into from
Jul 3, 2019

Conversation

ervteng
Copy link
Contributor

@ervteng ervteng commented Jun 17, 2019

Refactors the reward signals (extrinsic and curiosity) into separate class that inherits from RewardSignal. The Trainer is now reward signal agnostic, and doesn't check the config whether or not each type exists - it just instantiates all of the classes declared.

This is in preparation for additional reward signals (e.g. GAIL) as well as reuse across different trainers.

Also, it is equivalent to the IRL PR but without the new features (GAIL and PreTraining). We're breaking up this PR into two to make it easier to review.

@chriselion
Copy link
Contributor

Overall looks pretty good to me. I'd feel better if someone more familiar with the specific models gave it a once-over though.

@@ -174,11 +190,14 @@ WalkerLearning:
time_horizon: 1000
batch_size: 2048
buffer_size: 20480
gamma: 0.995
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If gamma is removed from here, does this mean that old versions of the config will no longer be compatible. Is this going to break people's stuff (I am okay with it) or is there a fallback ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the old versions of the config aren't compatible. Having gamma won't break anything, but it will end up using the default gamma from default_config. We could auto-assign the gamma here to the extrinsic gamma, but that would break the abstraction. I guess we'll just have to be careful in the migration guide.

@@ -7,6 +7,10 @@ observations to the best action an agent can take in a given state. The
ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate
Python process (communicating with the running Unity application over a socket).

To train an agent, you will need to provide the agent one or more reward signals which
the agent should attempt to maximize. See [Reward Signals](Training-RewardSignals.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the agent will attempt the maximize." I know RL does not work but try to act like it does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will replace with "will learn to maximize"


### The Curiosity Reward Signal

@chriselion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this line for ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm supposed to write it at some point. @ervteng do you want to leave this empty for now and I'll do it in another PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works - let me add a one-liner for this PR so it isn't completely empty in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like this removed / addressed before merge

"""
self.loss = 10 * (0.2 * self.forward_loss + 0.8 * self.inverse_loss)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
self.update_batch = optimizer.minimize(self.loss)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @vincentpierre, where would you like the new line?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I am sorry I was not clear. I think there needs to be an empty line at the end of the document. It is possible that the line is there, just not appearing on github. In which case, ignore this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Black should catch this in CircleCI, but let me double-check. Sometimes VSCode does weird things with Black.

self.policy = policy
self.strength = strength

def evaluate(self, current_info, next_info):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specify the return type is RewardSignalResult for clarity.
Also, I am wondering if the scaling of the reward signal should be handled by the Reward signal or the trainer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the RewardSignalResult into the comment.

Hmm, we could go either way.
Pros of the current way: Trainer doesn't have to be aware of the strength, "Strength" could be defined differently for different reward signals.
Pros of doing it in the Trainer: more generic, would be much easier to add normalization of rewards in the future if we want to.

return {}

@classmethod
def check_config(cls, config_dict, param_keys=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be static ? Also, not only used in reward signals I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not static so I can get the cls name.

But yeah, there is similar logic in the Trainer (though slightly different), though it's weird to call a Trainer function in RewardSignals. I wonder if it's time to make a utils.py file with all of these common functions? Also some common logic with minibatching and sequence computing might go in this file.

# Make sure we have at least one reward_signal
if not self.trainer_parameters["reward_signals"]:
raise UnityTrainerException(
"No reward signals were defined. At least one must be used with the PPO trainer."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the class name, do not hard code PPO here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

BTW, I plan to make this even more generic, e.g. for printing the trainer_parameters, I'd move it to the base Trainer class and use the class name from there. Probably will come when we add SAC - with this PR I'm trying not to modify too much

@@ -64,7 +65,7 @@ Typical Range: `0.8` - `0.995`

### The Curiosity Reward Signal

@chriselion
The `curiosity` Reward Signal enables the Intrinsic Curiosity Module.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to paper ?

@CLAassistant
Copy link

CLAassistant commented Jul 1, 2019

CLA assistant check
All committers have signed the CLA.

which correspond to the agents in a provided next_info.
:BrainInfo next_info: A t+1 BrainInfo.
:return: curr_info: Reconstructed BrainInfo to match agents of next_info.
"""
visual_observations = [[]]
visual_observations = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? We do visual_observations[i].append(...) below, but never resize this directly. (I think it may have been broken before for >1 visual observations too)

Copy link
Contributor Author

@ervteng ervteng Jul 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed in the last commit (with types) - just marking for reference

@ervteng ervteng merged commit f4bc8ef into develop Jul 3, 2019
@chriselion chriselion deleted the develop-rewardsignalsrefactor branch July 11, 2019 22:26
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants