Refactor reward signals into separate class #2144

ervteng · 2019-06-17T21:48:55Z

Refactors the reward signals (extrinsic and curiosity) into separate class that inherits from RewardSignal. The Trainer is now reward signal agnostic, and doesn't check the config whether or not each type exists - it just instantiates all of the classes declared.

This is in preparation for additional reward signals (e.g. GAIL) as well as reuse across different trainers.

Also, it is equivalent to the IRL PR but without the new features (GAIL and PreTraining). We're breaking up this PR into two to make it easier to review.

# Conflicts: # ml-agents/mlagents/trainers/models.py

ml-agents/mlagents/trainers/ppo/trainer.py

chriselion · 2019-06-18T22:28:42Z

Overall looks pretty good to me. I'd feel better if someone more familiar with the specific models gave it a once-over though.

vincentpierre · 2019-06-18T22:52:11Z

config/trainer_config.yaml

@@ -174,11 +190,14 @@ WalkerLearning:
    time_horizon: 1000
    batch_size: 2048
    buffer_size: 20480
-    gamma: 0.995


If gamma is removed from here, does this mean that old versions of the config will no longer be compatible. Is this going to break people's stuff (I am okay with it) or is there a fallback ?

Yeah, the old versions of the config aren't compatible. Having gamma won't break anything, but it will end up using the default gamma from default_config. We could auto-assign the gamma here to the extrinsic gamma, but that would break the abstraction. I guess we'll just have to be careful in the migration guide.

vincentpierre · 2019-06-18T22:53:10Z

docs/Training-PPO.md

@@ -7,6 +7,10 @@ observations to the best action an agent can take in a given state. The
 ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate
 Python process (communicating with the running Unity application over a socket).

+To train an agent, you will need to provide the agent one or more reward signals which
+the agent should attempt to maximize. See [Reward Signals](Training-RewardSignals.md)


"the agent will attempt the maximize." I know RL does not work but try to act like it does.

Will replace with "will learn to maximize"

docs/Training-RewardSignals.md

vincentpierre · 2019-06-18T22:56:03Z

docs/Training-RewardSignals.md

+
+### The Curiosity Reward Signal
+
+@chriselion


What is this line for ?

I'm supposed to write it at some point. @ervteng do you want to leave this empty for now and I'll do it in another PR?

That works - let me add a one-liner for this PR so it isn't completely empty in this PR.

I would like this removed / addressed before merge

docs/Training-RewardSignals.md

vincentpierre · 2019-06-18T22:58:15Z

ml-agents/mlagents/trainers/components/reward_signals/curiosity/model.py

+        """
+        self.loss = 10 * (0.2 * self.forward_loss + 0.8 * self.inverse_loss)
+        optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
+        self.update_batch = optimizer.minimize(self.loss)


Hey @vincentpierre, where would you like the new line?

Ah, I am sorry I was not clear. I think there needs to be an empty line at the end of the document. It is possible that the line is there, just not appearing on github. In which case, ignore this comment.

Black should catch this in CircleCI, but let me double-check. Sometimes VSCode does weird things with Black.

vincentpierre · 2019-06-18T23:01:39Z

ml-agents/mlagents/trainers/components/reward_signals/reward_signal.py

+        self.policy = policy
+        self.strength = strength
+
+    def evaluate(self, current_info, next_info):


Can you specify the return type is RewardSignalResult for clarity.
Also, I am wondering if the scaling of the reward signal should be handled by the Reward signal or the trainer.

Added the RewardSignalResult into the comment.

Hmm, we could go either way.
Pros of the current way: Trainer doesn't have to be aware of the strength, "Strength" could be defined differently for different reward signals.
Pros of doing it in the Trainer: more generic, would be much easier to add normalization of rewards in the future if we want to.

vincentpierre · 2019-06-18T23:02:26Z

ml-agents/mlagents/trainers/components/reward_signals/reward_signal.py

+        return {}
+
+    @classmethod
+    def check_config(cls, config_dict, param_keys=None):


Should this be static ? Also, not only used in reward signals I think.

It's not static so I can get the cls name.

But yeah, there is similar logic in the Trainer (though slightly different), though it's weird to call a Trainer function in RewardSignals. I wonder if it's time to make a utils.py file with all of these common functions? Also some common logic with minibatching and sequence computing might go in this file.

vincentpierre · 2019-06-18T23:06:06Z

ml-agents/mlagents/trainers/ppo/trainer.py

+        # Make sure we have at least one reward_signal
+        if not self.trainer_parameters["reward_signals"]:
+            raise UnityTrainerException(
+                "No reward signals were defined. At least one must be used with the PPO trainer."


Use the class name, do not hard code PPO here ?

Fixed.

BTW, I plan to make this even more generic, e.g. for printing the trainer_parameters, I'd move it to the base Trainer class and use the class name from there. Probably will come when we add SAC - with this PR I'm trying not to modify too much

vincentpierre · 2019-06-20T00:47:16Z

docs/Training-RewardSignals.md

@@ -64,7 +65,7 @@ Typical Range: `0.8` - `0.995`

 ### The Curiosity Reward Signal

-@chriselion
+The `curiosity` Reward Signal enables the Intrinsic Curiosity Module.


Link to paper ?

CLAassistant · 2019-07-01T20:29:02Z

All committers have signed the CLA.

chriselion · 2019-07-01T20:37:58Z

ml-agents/mlagents/trainers/ppo/trainer.py

        which correspond to the agents in a provided next_info.
        :BrainInfo next_info: A t+1 BrainInfo.
        :return: curr_info: Reconstructed BrainInfo to match agents of next_info.
        """
-        visual_observations = [[]]
+        visual_observations = []


Is this correct? We do visual_observations[i].append(...) below, but never resize this directly. (I think it may have been broken before for >1 visual observations too)

This was changed in the last commit (with types) - just marking for reference

…alsrefactor

…nologies/ml-agents into develop-rewardsignalsrefactor

ml-agents/mlagents/trainers/components/reward_signals/curiosity/model.py

awjuliani and others added 30 commits April 24, 2019 13:49

New version of GAIL

eb4abf2

Move Curiosity to separate class

d0852ac

Curiosity fully working under new system

4b15b80

Begin implementing GAIL

ad9381b

fix discrete curiosity

8bf8302

# Conflicts: # ml-agents/mlagents/trainers/models.py

Add expert demonstration

d3e244e

Remove notebook

a5b95f7

Record intrinsic rewards properly

dc2fcaa

Add gail model updating

49cff40

Code cleanup

48d3769

Nested structure for intrinsic rewards

6eeb565

Rename files

8ca7728

Update models so files

226b5c7

fix typo

3386aa7

Add reward strength parameter

6799756

Use dictionary of reward signals

468c407

Remove reward manager

519e2d3

Extrinsic reward just another type

7df1a69

Clean up imports

99237cd

All reward signals use strength to scale output

9fa51c1

produce scaled and unscaled reward

7f24677

Remove unused dictionary

4a714d0

Current trainer config

3e2671d

Add discrete control and pyramid experimentation

77211d8

Minor changes to GAIL

2334de8

Add relevant strength parameters

439387e

Replace string

ba793a3

Add support for visual observations w/ GAIL

a52ba0b

Finish implementing visual obs for GAIL

5b2ef22

Include demo files

13542b4

chriselion reviewed Jun 18, 2019

View reviewed changes

ml-agents/mlagents/trainers/ppo/trainer.py Outdated Show resolved Hide resolved

Ervin Teng added 3 commits June 18, 2019 15:36

Use NamedTuple and more code cleanup

b03de8f

Recursive printing of hyperparams

3499d60

Black format

411db72

vincentpierre approved these changes Jun 18, 2019

View reviewed changes

Ervin Teng added 4 commits June 19, 2019 12:45

Doc fixes

84478bf

Fixed comment for evaluate

a5f148c

More doc tweaks

3733a68

Make PPO prints more generic

32815f5

vincentpierre reviewed Jun 20, 2019

View reviewed changes

vincentpierre approved these changes Jun 20, 2019

View reviewed changes

Ervin Teng and others added 6 commits June 20, 2019 11:46

fix crawler dynamic hyperparams

70f7407

Clean up doc formatting

0d02b24

Change setup.py so all packages are installed

bba9d7d

Tweak pyramids hyperparams

de2b5d5

More tweaks to Pyramids

adc9915

curiosity doc section

6a1d8d1

chriselion reviewed Jul 1, 2019

View reviewed changes

Chris Elion and others added 4 commits July 1, 2019 13:41

Merge remote-tracking branch 'origin/develop' into develop-rewardsign…

554b1c2

…alsrefactor

get mypy passing

52d3974

Tweak Pyramids hyperparameters

471b489

Merge branch 'develop-rewardsignalsrefactor' of github.com:Unity-Tech…

ed5e84e

…nologies/ml-agents into develop-rewardsignalsrefactor

chriselion reviewed Jul 2, 2019

View reviewed changes

ml-agents/mlagents/trainers/components/reward_signals/curiosity/model.py Outdated Show resolved Hide resolved

Call static function rather than class function

87d77b4

ervteng merged commit f4bc8ef into develop Jul 3, 2019

chriselion mentioned this pull request Jul 11, 2019

add some types to the reward signals #2215

Merged

chriselion deleted the develop-rewardsignalsrefactor branch July 11, 2019 22:26

github-actions bot locked as resolved and limited conversation to collaborators May 18, 2021

Refactor reward signals into separate class #2144

Refactor reward signals into separate class #2144

Uh oh!

Conversation

ervteng commented Jun 17, 2019

Uh oh!

Uh oh!

chriselion commented Jun 18, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented Jul 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ervteng Jul 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

CLAassistant commented Jul 1, 2019 •

edited

Loading

ervteng Jul 3, 2019 •

edited

Loading