Skip to content

Modification of reward signals and rl_trainer for SAC #2433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 269 commits into from
Aug 15, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
269 commits
Select commit Hold shift + click to select a range
9fa51c1
All reward signals use strength to scale output
awjuliani Oct 17, 2018
7f24677
produce scaled and unscaled reward
awjuliani Oct 18, 2018
4a714d0
Remove unused dictionary
awjuliani Oct 18, 2018
3e2671d
Current trainer config
awjuliani Oct 18, 2018
77211d8
Add discrete control and pyramid experimentation
awjuliani Oct 19, 2018
2334de8
Minor changes to GAIL
awjuliani Oct 20, 2018
439387e
Add relevant strength parameters
awjuliani Oct 21, 2018
ba793a3
Replace string
awjuliani Oct 21, 2018
a52ba0b
Add support for visual observations w/ GAIL
awjuliani Oct 31, 2018
5b2ef22
Finish implementing visual obs for GAIL
awjuliani Nov 1, 2018
13542b4
Include demo files
awjuliani Nov 1, 2018
ae7a8b0
Fix for RNN w/ GAIL
awjuliani Nov 1, 2018
bf89082
Keep track of reward streams separately
awjuliani Nov 2, 2018
360482b
Bootstrap value estimates separately
awjuliani Nov 2, 2018
c78639d
Add value head
awjuliani Nov 14, 2018
3b2485d
Use sepaprate value streams for each reward
awjuliani Nov 15, 2018
40bc9ba
Add VAIL
awjuliani Nov 15, 2018
c6e1504
Use adaptive B
awjuliani Nov 16, 2018
60d9ff7
Comments improvements
vincentpierre Jan 10, 2019
49ec682
Added comments and refactored a pievce of the code
vincentpierre Jan 10, 2019
d9847e0
Added Comments
vincentpierre Jan 10, 2019
dc7620b
Fix on Curriosity
vincentpierre Jan 11, 2019
28e0bd5
Fixed typo
vincentpierre Jan 11, 2019
0257d2b
Added a forgotten comment
vincentpierre Jan 11, 2019
fd55c00
Stabilized Vail learning. Still no learning for Walker
vincentpierre Jan 14, 2019
2343b3f
Fixing typo on curiosity when using visual input
vincentpierre Jan 17, 2019
c74ad19
Added some comments
vincentpierre Jan 17, 2019
2dd7c61
modified the hyperparameters
vincentpierre Jan 17, 2019
42429a5
Fixed some of the tests, will need to refactor the reward signals in …
vincentpierre Jan 19, 2019
ec0e106
Putting the has_updated fags inside each reward signal
vincentpierre Jan 22, 2019
6ae1c2f
Added comments for the GAIL update method
vincentpierre Jan 22, 2019
ef65bc2
initial commit
vincentpierre Jan 24, 2019
8cbdbf4
No more normalization after pre-training
vincentpierre Jan 24, 2019
3f35d45
Fixed large bug in Vail
vincentpierre Jan 30, 2019
3be9be7
BUG FIX VAIL : The noise dimension was wrong and the discriminator sc…
vincentpierre Feb 1, 2019
9e9b4ff
implemented discrete control pretraining
vincentpierre Feb 2, 2019
d537a6b
bug fixing
vincentpierre Feb 3, 2019
713263c
Bug fix, still not tested for recurrent
vincentpierre Feb 6, 2019
ca5b948
Fixing beta in GAIL so it will change properly
vincentpierre Mar 6, 2019
671629e
Allow for not specifying an extrinsic reward
Apr 19, 2019
a31c8a5
Rough implementation of annealed BC
Apr 24, 2019
93cb4ff
Fixes for rebase onto v0.8
Apr 24, 2019
6534291
Moved BC trainer out of reward_signals and code cleanup
Apr 25, 2019
700b478
Rename folder to "components"
Apr 25, 2019
71eedf5
Fix renaming in Curiosity
Apr 25, 2019
83b4603
Remove demo_aided as a required param
May 2, 2019
9e4b4e2
Make old BC compatible
May 2, 2019
f814432
Fix visual obs for curiosity
May 3, 2019
e10194f
Tweaks all around
May 9, 2019
fdcfb30
Add reward normalization and bug fix
May 9, 2019
984e602
Initial commit (NaN problem)
May 11, 2019
5e60587
Add to initial commit
May 11, 2019
c24031e
No more NaNs - somewhat working
May 13, 2019
b371ff7
Fixed normalization and a bunch of other bugs
May 14, 2019
63f78ba
Fix buffer truncate and remove debug graph
May 14, 2019
199879b
Stats and config for trainer
May 14, 2019
908fb17
Black format and properly scope visual encoders
May 14, 2019
f4c4fb5
Switch back to swish
May 14, 2019
33aecec
Fix for models.py
May 14, 2019
bfb787d
Add broken discrete action support
May 15, 2019
03568f7
Still broken
May 15, 2019
ce54bd4
Discrete still broken but somewhat less
May 15, 2019
03b6a9c
Fix issue with PPO - discrete still broken
May 16, 2019
e4cad3a
Discrete still broken but less so?
May 17, 2019
8f6798d
Much less broken discrete
May 18, 2019
694ce36
Better support for branched actions
May 21, 2019
d370528
Lots of fixes
May 21, 2019
20df6b4
Truncate buffer properly
May 21, 2019
2c51640
Fix major bug with buffer and expose more hyperparams
May 21, 2019
c443f82
Clean up hyperparameters
May 21, 2019
61882d0
Fix some errors with hyperparams
May 21, 2019
aa523a0
Fixes for visual obs
May 23, 2019
ec70b3c
Add flag to share CNN actor/critic
May 29, 2019
de3064d
Add more SAC default params
May 29, 2019
cb5e927
Load multiple .demo files. Fix bug with csv nans
May 30, 2019
2c5c853
Remove reward normalization
May 30, 2019
e66a343
Rename demo_aided to pretraining
May 30, 2019
0a98289
Fix bc configs
May 30, 2019
cd6e498
Increase small val to prevent NaNs
May 30, 2019
d23f6f3
Fix init in components
May 31, 2019
d93e36e
Merge remote-tracking branch 'origin/develop' into develop-irl-ervin
May 31, 2019
1bf68c7
Fix PPO tests
May 31, 2019
9da6e6c
Refactor components into common location
May 31, 2019
10a175b
Merge remote-tracking branch 'origin/develop-irl-ervin' into develop-sac
May 31, 2019
f91b490
Clean up SAC trainer
May 31, 2019
4a57a32
Minor code cleanup
Jun 3, 2019
b920427
NaN removal
Jun 3, 2019
e41c1ac
Refactor components and add basic tests for SAC
Jun 4, 2019
85c8ab5
GAIL SAC working
Jun 4, 2019
3598f39
Properly set GAIL batch size from outside
Jun 4, 2019
7c0ef58
Minor tweaks
Jun 4, 2019
11cc6f9
Preliminary RNN support
Jun 5, 2019
e66a6f7
Revert regression with NaNs for LSTMs
Jun 6, 2019
bea2bc7
Better LSTM support for BC
Jun 6, 2019
6302a55
Code cleanup and black reformat
Jun 6, 2019
d1cded9
Remove demo_helper and reformat signal
Jun 6, 2019
2b98f3b
Tests for GAIL and curiosity
Jun 6, 2019
440146b
Fix Black again...
Jun 6, 2019
98f9160
Tests for BCModule and visual tests for RewardSignals
Jun 6, 2019
5c923cb
Refactor to new structure and use class generator
Jun 7, 2019
e7ce888
Generalize reward_signal interface and stats
Jun 8, 2019
858194f
Fix incorrect environment reward reporting
Jun 10, 2019
28bceba
Rename reward signals for consistency. clean up comments
Jun 10, 2019
248cae4
Default trainer config (for cloud testing)
Jun 10, 2019
744df94
Remove "curiosity_enc_size" from the regular params
Jun 10, 2019
31dabfc
Fix PushBlock config
Jun 10, 2019
a557f84
Revert Pyramids environment
Jun 10, 2019
e43b506
Add trainer_config for debug
Jun 11, 2019
76cfe88
Fix broken discrete after GAIL changes
Jun 11, 2019
d4dbddb
Fix indexing issue with add_experiences
Jun 11, 2019
ddb673b
Fix tests
Jun 11, 2019
975e05b
Change to BCModule
Jun 11, 2019
a83fd5d
Merge branch 'develop' into develop-irl-ervin
Jun 12, 2019
0f9ef6b
Merge latest develop-irl-ervin into develop-sac
Jun 12, 2019
fae7646
Remove the bools for reward signals
Jun 12, 2019
f907b78
Merge commit 'fae764636ec43ad41d000f91e0154b5c422fdfde' into develop-sac
Jun 12, 2019
1fcfba7
Adapt new reward system to SAC
Jun 12, 2019
37552b6
GAIL SAC actually working
Jun 12, 2019
8f451b6
Fix visual obs for SAC GAIL
Jun 12, 2019
f1baec4
Fix reporting for SAC rward signals
Jun 12, 2019
5cf98ac
Make update take in a mini buffer rather than the
Jun 13, 2019
d1afc9b
Always reference reward signals name and not index
Jun 13, 2019
80f2c75
More code cleanup
Jun 13, 2019
394b25a
Clean up reward_signal abstract class
Jun 13, 2019
a9724a3
Fix issue with recording values
Jun 13, 2019
30040b5
Merge branch 'develop-irl-ervin' into develop-sac
Jun 13, 2019
a4606ef
Fix SAC value head reporting
Jun 13, 2019
bf6b540
SAC LSTM support
Jun 13, 2019
58887c0
Fix the NaN bug with LSTM AGAIN
Jun 14, 2019
6d6a9c6
Fix lots of small LSTM bugs
Jun 14, 2019
739583b
Fix non-recurrent SAC
Jun 14, 2019
855b3d4
Fix sac trainer_config
Jun 14, 2019
0700c89
Fix masking and continuous LSTM
Jun 14, 2019
51f45a0
Fix issue with sequence length (just for SAC) and masks for entcoef
Jun 14, 2019
229e1e3
Fix masking for discrete
Jun 15, 2019
15dbd66
Better hyperparams for Hallway
Jun 15, 2019
e1c6201
Better hyperparams for other recurrent
Jun 15, 2019
66fef61
Add use_actions to GAIL
Jun 17, 2019
0e3be1d
Add documentation for Reward Signals
Jun 17, 2019
015f50d
Add documentation for GAIL
Jun 17, 2019
7c3059b
Remove unused variables in BCModel
Jun 17, 2019
16c3c06
Remove Entropy Reward Signal
Jun 17, 2019
1fbfa5d
Change tests to use safe_load
Jun 17, 2019
f9a3808
Don't use mutable default
Jun 17, 2019
ce551bf
Set defaults in parent __init__ (Reward Signals)
Jun 17, 2019
3e7ea5b
Remove unneccesary lines
Jun 17, 2019
848b1d6
Fix issue with branched actions and entropy
Jun 22, 2019
8b4aa74
black format
Jun 22, 2019
eda6993
Merge branch 'develop' into develop-irl-ervin
Jul 3, 2019
cace2e6
Make some files same as develop
Jul 3, 2019
3f161fc
Add demos for example envs
Jul 4, 2019
2794c75
Update docs
Jul 4, 2019
48b7b43
Fix tests, imports, cleanup code
Jul 8, 2019
f47b173
Make pretrainer stats similar to reward signal
Jul 9, 2019
d054da6
Fix for non-branched actions
Jul 9, 2019
5511dc6
Some cleanup on trainer
Jul 9, 2019
1e257d4
Merge branch 'develop' of github.com:Unity-Technologies/ml-agents int…
Jul 9, 2019
54ed027
Merge branch 'develop' of github.com:Unity-Technologies/ml-agents int…
Jul 10, 2019
a8b5d09
Fixes after merge develop
Jul 10, 2019
3bb7f44
Merge branch 'develop-irl-ervin' of github.com:Unity-Technologies/ml-…
Jul 10, 2019
34f5053
Fix policy naming
Jul 10, 2019
39b9738
Minor trainer refactor
Jul 10, 2019
759b974
More bug fixes
Jul 10, 2019
fb3d5ae
Additional tests, bugfix for LSTM+BC+Visual
Jul 10, 2019
7e0a677
GAIL code cleanup
Jul 10, 2019
1953233
Add types to BCModel
Jul 10, 2019
593f819
Fix bugs with incorrect return values
Jul 11, 2019
98b7732
Change tests to use RewardSignalResult
Jul 11, 2019
6ee0c63
Add docs for pretraining and plot for all three
Jul 11, 2019
6d37be2
Fix bug with demo loading directories, add test
Jul 11, 2019
c672ad9
Add typing to BCModule, GAIL, and demo loader
Jul 11, 2019
61e84c6
Fix black
Jul 11, 2019
9d43336
Fix mypy issues
Jul 11, 2019
99a2a3c
Codacy cleanup
Jul 12, 2019
dac9f53
Fix entropy and add LSTM tests
Jul 12, 2019
3cc2aeb
Merge latest develop-irl
Jul 12, 2019
d100f7f
Merge branch 'develop' of github.com:Unity-Technologies/ml-agents int…
Jul 16, 2019
da2cca6
Break up reward signal update and policy update
Jul 17, 2019
ba7c2fe
Fix evaluate_batch for GAIL
Jul 17, 2019
b530d8a
Rename configs to make it less confusing
Jul 18, 2019
464163d
Add gradient penalty to GAIL (GAIL-GP)
Jul 18, 2019
d2d4255
Remove duplicate parameters() from base trainer
Jul 18, 2019
ea2c954
Reset metrics rewards in SAC
Jul 18, 2019
6c84a60
Fix NaNs in gradient penalty
Jul 19, 2019
300669b
Fixed trainer_config
Jul 22, 2019
3098618
Merge branch 'develop' of github.com:Unity-Technologies/ml-agents int…
Jul 22, 2019
fac83e0
Fix issues with latest merge
Jul 22, 2019
eb3e92f
Fix issue with stream_scope parameter
Jul 22, 2019
81fe5bd
black reformat
Jul 22, 2019
dce9a64
merge latest GAIL changes
Jul 22, 2019
d6e63d0
Clean up linting
Jul 22, 2019
f91b44e
Fix policy import
Jul 22, 2019
6477510
More complex model for crawler
Jul 23, 2019
51800f9
Fix issues with Walljump
Jul 24, 2019
0535858
Add optional replay buffer saving
Jul 24, 2019
d5cddaa
Better scoping - fixes Barracuda issue
Jul 24, 2019
62aff9a
Fix Barracuda outputs for discrete
Jul 24, 2019
f1cf155
Fix Barracuda discrete output
Jul 25, 2019
33de3c7
Remove files that don't do anything
Jul 25, 2019
3030529
Fix broken continuous
Jul 25, 2019
78495b2
Fix flake stuff
Jul 25, 2019
ea5816c
Update trainer configs
Jul 25, 2019
1426607
Update trainer_config
Jul 26, 2019
51a8167
Remove unneccessary feed_dicts from reward signals
Jul 26, 2019
48e9502
Add proper typing to evaluate_batch
Jul 26, 2019
8549155
Add curiosity evaluate_batch
Jul 27, 2019
85554d7
Fix SAC action_holder
Jul 27, 2019
468c014
More curiosity and GAIL cleanup for LSTM
Jul 27, 2019
a1a0745
Re-enable output_pre in SAC for symmetry
Jul 27, 2019
906e8c5
More Curiosity cleanup
Jul 27, 2019
05f8f08
Fix broken discrete and add test for discrete
Jul 27, 2019
020ef94
Merge latest Release changes into SAC
Jul 29, 2019
ec5808d
Fix for ordering of params in Trainer
Jul 29, 2019
af5c725
Better scoping
Jul 29, 2019
7120ed1
Name LSTMs properly
Jul 29, 2019
203ab2e
Use scaled init for SAC
Jul 29, 2019
8366a55
Buffer cleanup
Jul 29, 2019
eaa75c3
Separate Policy LSTM from the others
Jul 31, 2019
eea94e2
Remove some unused methods
Aug 2, 2019
4bc5dd4
Bad trainer config!
Aug 2, 2019
212b38f
Remove some unused code
Aug 2, 2019
5d83a42
Fix SAC test
Aug 3, 2019
7ec8d14
Remove reward signal train interval
Aug 5, 2019
a1eb146
Enable pretraining
Aug 6, 2019
6a39d5c
Fix scoping for visual observations
Aug 8, 2019
f72d1ee
Initial merge of flattened buffer
Aug 8, 2019
6d32fd7
Merge latest trainer refactor (develop)
Aug 8, 2019
304898e
Fixes for SAC + new buffer
Aug 9, 2019
6b6a0df
Finish refactor of SAC
Aug 9, 2019
1592b64
Minor refactor of rl_trainer add_experiences for SAC
Aug 9, 2019
3bafd3c
Add SAC environment test
Aug 9, 2019
725b3e5
Refactor reward signals to remove duplicate code
Aug 9, 2019
7e0456d
Some reward signal cleanup
Aug 9, 2019
b2768e7
Move mock_brain creation to common file
Aug 12, 2019
206ba89
Improve SAC policy testing
Aug 12, 2019
3a7d872
Fix loading replay buffer and add tests
Aug 12, 2019
a61e3e9
Move end_episode
Aug 12, 2019
a3ca93c
Revert trainer_config for PPO
Aug 12, 2019
656a3b4
Merge reward signal parallel update
Aug 13, 2019
8459106
Fix issue with dones in GAIL
Aug 13, 2019
aba9a92
Fix other reward signals not reported to TB
Aug 13, 2019
1a31b96
Remove SAC changes
Aug 14, 2019
1873f9b
Remove SAC trainer config
Aug 14, 2019
19e1607
Revert __init__
Aug 14, 2019
4de66a1
Revert tests
Aug 14, 2019
795730b
Revert trainer util
Aug 14, 2019
c227397
Use sample method for GAIL
Aug 14, 2019
45fa433
Add policy estimate and expert estimate for debug
Aug 14, 2019
5f595ee
Address comments
Aug 14, 2019
47f5f84
Fix bug with BCModule and LSTM
Aug 15, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Minor changes to GAIL
  • Loading branch information
awjuliani authored and Ervin Teng committed Apr 24, 2019
commit 2334de823f42da9dcf51faa069287a8c49f4020a
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def create_curiosity_encoders(self):
self.policy_model.curiosity_enc_size,
LearningModel.swish, 1,
"stream_{}_visual_obs_encoder"
.format(i), False)
.format(i), False)

encoded_next_visual = self.policy_model.create_visual_obs_encoder(
self.next_visual_in[i],
Expand Down
14 changes: 9 additions & 5 deletions ml-agents/mlagents/trainers/ppo/reward_signals/gail/model.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import tensorflow as tf
import numpy as np


class GAILModel(object):
Expand All @@ -14,6 +13,11 @@ def __init__(self, policy_model, h_size, lr):
def make_inputs(self):
self.obs_in_expert = tf.placeholder(
shape=[None, self.policy_model.vec_obs_size], dtype=tf.float32)
self.done_expert = tf.placeholder(
shape=[None, 1], dtype=tf.float32)
self.done_policy = tf.placeholder(
shape=[None, 1], dtype=tf.float32)

if self.policy_model.brain.vector_action_space_type == 'continuous':
action_length = self.policy_model.act_size[0]
self.action_in_expert = tf.placeholder(
Expand All @@ -27,9 +31,9 @@ def make_inputs(self):
tf.one_hot(self.action_in_expert[:, i], self.policy_model.act_size[i]) for i in
range(len(self.policy_model.act_size))], axis=1)

def create_encoder(self, state_in, action_in, reuse):
def create_encoder(self, state_in, action_in, done_in, reuse):
with tf.variable_scope("model"):
concat_input = tf.concat([state_in, action_in], axis=1)
concat_input = tf.concat([state_in, action_in, done_in], axis=1)

hidden_1 = tf.layers.dense(
concat_input, self.h_size, activation=tf.nn.elu,
Expand All @@ -45,9 +49,9 @@ def create_encoder(self, state_in, action_in, reuse):

def create_network(self):
self.expert_estimate = self.create_encoder(
self.obs_in_expert, self.expert_action, False)
self.obs_in_expert, self.expert_action, self.done_expert, False)
self.policy_estimate = self.create_encoder(
self.policy_model.vector_in, self.policy_model.selected_actions, True)
self.policy_model.vector_in, self.policy_model.selected_actions, self.done_policy, True)
self.discriminator_score = tf.reshape(self.policy_estimate, [-1], name="GAIL_reward")
self.intrinsic_reward = -tf.log(1.0 - self.discriminator_score + 1e-7)

Expand Down
4 changes: 4 additions & 0 deletions ml-agents/mlagents/trainers/ppo/reward_signals/gail/signal.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ def evaluate(self, current_info, next_info):
feed_dict = {self.policy.model.batch_size: len(next_info.vector_observations),
self.policy.model.sequence_length: 1}
feed_dict = self.policy.fill_eval_dict(feed_dict, brain_info=current_info)
feed_dict[self.model.done_policy] = np.reshape(next_info.local_done, [-1, 1])
if self.policy.use_continuous_act:
feed_dict[self.policy.model.selected_actions] = next_info.previous_vector_actions
else:
Expand Down Expand Up @@ -57,6 +58,9 @@ def update(self, policy_buffer, n_sequences, max_batches):

def _update_batch(self, mini_batch_demo, mini_batch_policy):
feed_dict = {}
feed_dict[self.model.done_expert] = mini_batch_demo['done'].reshape([-1, 1])
feed_dict[self.model.done_policy] = mini_batch_policy['done'].reshape([-1, 1])

if self.policy.use_continuous_act:
feed_dict[self.policy.model.selected_actions] = mini_batch_policy['actions'].reshape(
[-1, self.policy.model.act_size[0]])
Expand Down
1 change: 1 addition & 0 deletions ml-agents/mlagents/trainers/ppo/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -308,6 +308,7 @@ def add_experiences(
self.training_buffer[agent_id]['prev_action'].append(
stored_info.previous_vector_actions[idx])
self.training_buffer[agent_id]['masks'].append(1.0)
self.training_buffer[agent_id]['done'].append(next_info.local_done[idx])

agent_rewards = None
for (scaled_reward, reward) in tmp_rewards_list:
Expand Down