Skip to content

Improvements for GAIL #2296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 160 commits into from
Jul 22, 2019
Merged
Changes from 1 commit
Commits
Show all changes
160 commits
Select commit Hold shift + click to select a range
eb4abf2
New version of GAIL
awjuliani Oct 9, 2018
d0852ac
Move Curiosity to separate class
awjuliani Oct 12, 2018
4b15b80
Curiosity fully working under new system
awjuliani Oct 12, 2018
ad9381b
Begin implementing GAIL
awjuliani Oct 12, 2018
8bf8302
fix discrete curiosity
vincentpierre Oct 12, 2018
d3e244e
Add expert demonstration
awjuliani Oct 13, 2018
a5b95f7
Remove notebook
awjuliani Oct 13, 2018
dc2fcaa
Record intrinsic rewards properly
awjuliani Oct 13, 2018
49cff40
Add gail model updating
awjuliani Oct 13, 2018
48d3769
Code cleanup
awjuliani Oct 15, 2018
6eeb565
Nested structure for intrinsic rewards
awjuliani Oct 15, 2018
8ca7728
Rename files
awjuliani Oct 15, 2018
226b5c7
Update models so files
awjuliani Oct 15, 2018
3386aa7
fix typo
awjuliani Oct 15, 2018
6799756
Add reward strength parameter
awjuliani Oct 15, 2018
468c407
Use dictionary of reward signals
awjuliani Oct 17, 2018
519e2d3
Remove reward manager
awjuliani Oct 17, 2018
7df1a69
Extrinsic reward just another type
awjuliani Oct 17, 2018
99237cd
Clean up imports
awjuliani Oct 17, 2018
9fa51c1
All reward signals use strength to scale output
awjuliani Oct 17, 2018
7f24677
produce scaled and unscaled reward
awjuliani Oct 18, 2018
4a714d0
Remove unused dictionary
awjuliani Oct 18, 2018
3e2671d
Current trainer config
awjuliani Oct 18, 2018
77211d8
Add discrete control and pyramid experimentation
awjuliani Oct 19, 2018
2334de8
Minor changes to GAIL
awjuliani Oct 20, 2018
439387e
Add relevant strength parameters
awjuliani Oct 21, 2018
ba793a3
Replace string
awjuliani Oct 21, 2018
a52ba0b
Add support for visual observations w/ GAIL
awjuliani Oct 31, 2018
5b2ef22
Finish implementing visual obs for GAIL
awjuliani Nov 1, 2018
13542b4
Include demo files
awjuliani Nov 1, 2018
ae7a8b0
Fix for RNN w/ GAIL
awjuliani Nov 1, 2018
bf89082
Keep track of reward streams separately
awjuliani Nov 2, 2018
360482b
Bootstrap value estimates separately
awjuliani Nov 2, 2018
c78639d
Add value head
awjuliani Nov 14, 2018
3b2485d
Use sepaprate value streams for each reward
awjuliani Nov 15, 2018
40bc9ba
Add VAIL
awjuliani Nov 15, 2018
c6e1504
Use adaptive B
awjuliani Nov 16, 2018
60d9ff7
Comments improvements
vincentpierre Jan 10, 2019
49ec682
Added comments and refactored a pievce of the code
vincentpierre Jan 10, 2019
d9847e0
Added Comments
vincentpierre Jan 10, 2019
dc7620b
Fix on Curriosity
vincentpierre Jan 11, 2019
28e0bd5
Fixed typo
vincentpierre Jan 11, 2019
0257d2b
Added a forgotten comment
vincentpierre Jan 11, 2019
fd55c00
Stabilized Vail learning. Still no learning for Walker
vincentpierre Jan 14, 2019
2343b3f
Fixing typo on curiosity when using visual input
vincentpierre Jan 17, 2019
c74ad19
Added some comments
vincentpierre Jan 17, 2019
2dd7c61
modified the hyperparameters
vincentpierre Jan 17, 2019
42429a5
Fixed some of the tests, will need to refactor the reward signals in …
vincentpierre Jan 19, 2019
ec0e106
Putting the has_updated fags inside each reward signal
vincentpierre Jan 22, 2019
6ae1c2f
Added comments for the GAIL update method
vincentpierre Jan 22, 2019
ef65bc2
initial commit
vincentpierre Jan 24, 2019
8cbdbf4
No more normalization after pre-training
vincentpierre Jan 24, 2019
3f35d45
Fixed large bug in Vail
vincentpierre Jan 30, 2019
3be9be7
BUG FIX VAIL : The noise dimension was wrong and the discriminator sc…
vincentpierre Feb 1, 2019
9e9b4ff
implemented discrete control pretraining
vincentpierre Feb 2, 2019
d537a6b
bug fixing
vincentpierre Feb 3, 2019
713263c
Bug fix, still not tested for recurrent
vincentpierre Feb 6, 2019
ca5b948
Fixing beta in GAIL so it will change properly
vincentpierre Mar 6, 2019
671629e
Allow for not specifying an extrinsic reward
Apr 19, 2019
a31c8a5
Rough implementation of annealed BC
Apr 24, 2019
93cb4ff
Fixes for rebase onto v0.8
Apr 24, 2019
6534291
Moved BC trainer out of reward_signals and code cleanup
Apr 25, 2019
700b478
Rename folder to "components"
Apr 25, 2019
71eedf5
Fix renaming in Curiosity
Apr 25, 2019
83b4603
Remove demo_aided as a required param
May 2, 2019
9e4b4e2
Make old BC compatible
May 2, 2019
f814432
Fix visual obs for curiosity
May 3, 2019
e10194f
Tweaks all around
May 9, 2019
fdcfb30
Add reward normalization and bug fix
May 9, 2019
cb5e927
Load multiple .demo files. Fix bug with csv nans
May 30, 2019
2c5c853
Remove reward normalization
May 30, 2019
e66a343
Rename demo_aided to pretraining
May 30, 2019
0a98289
Fix bc configs
May 30, 2019
cd6e498
Increase small val to prevent NaNs
May 30, 2019
d23f6f3
Fix init in components
May 31, 2019
d93e36e
Merge remote-tracking branch 'origin/develop' into develop-irl-ervin
May 31, 2019
1bf68c7
Fix PPO tests
May 31, 2019
9da6e6c
Refactor components into common location
May 31, 2019
4a57a32
Minor code cleanup
Jun 3, 2019
11cc6f9
Preliminary RNN support
Jun 5, 2019
e66a6f7
Revert regression with NaNs for LSTMs
Jun 6, 2019
bea2bc7
Better LSTM support for BC
Jun 6, 2019
6302a55
Code cleanup and black reformat
Jun 6, 2019
d1cded9
Remove demo_helper and reformat signal
Jun 6, 2019
2b98f3b
Tests for GAIL and curiosity
Jun 6, 2019
440146b
Fix Black again...
Jun 6, 2019
98f9160
Tests for BCModule and visual tests for RewardSignals
Jun 6, 2019
5c923cb
Refactor to new structure and use class generator
Jun 7, 2019
e7ce888
Generalize reward_signal interface and stats
Jun 8, 2019
858194f
Fix incorrect environment reward reporting
Jun 10, 2019
28bceba
Rename reward signals for consistency. clean up comments
Jun 10, 2019
248cae4
Default trainer config (for cloud testing)
Jun 10, 2019
744df94
Remove "curiosity_enc_size" from the regular params
Jun 10, 2019
31dabfc
Fix PushBlock config
Jun 10, 2019
a557f84
Revert Pyramids environment
Jun 10, 2019
d4dbddb
Fix indexing issue with add_experiences
Jun 11, 2019
ddb673b
Fix tests
Jun 11, 2019
975e05b
Change to BCModule
Jun 11, 2019
a83fd5d
Merge branch 'develop' into develop-irl-ervin
Jun 12, 2019
fae7646
Remove the bools for reward signals
Jun 12, 2019
5cf98ac
Make update take in a mini buffer rather than the
Jun 13, 2019
d1afc9b
Always reference reward signals name and not index
Jun 13, 2019
80f2c75
More code cleanup
Jun 13, 2019
394b25a
Clean up reward_signal abstract class
Jun 13, 2019
a9724a3
Fix issue with recording values
Jun 13, 2019
66fef61
Add use_actions to GAIL
Jun 17, 2019
0e3be1d
Add documentation for Reward Signals
Jun 17, 2019
015f50d
Add documentation for GAIL
Jun 17, 2019
7c3059b
Remove unused variables in BCModel
Jun 17, 2019
16c3c06
Remove Entropy Reward Signal
Jun 17, 2019
1fbfa5d
Change tests to use safe_load
Jun 17, 2019
f9a3808
Don't use mutable default
Jun 17, 2019
ce551bf
Set defaults in parent __init__ (Reward Signals)
Jun 17, 2019
3e7ea5b
Remove unneccesary lines
Jun 17, 2019
eda6993
Merge branch 'develop' into develop-irl-ervin
Jul 3, 2019
cace2e6
Make some files same as develop
Jul 3, 2019
3f161fc
Add demos for example envs
Jul 4, 2019
2794c75
Update docs
Jul 4, 2019
48b7b43
Fix tests, imports, cleanup code
Jul 8, 2019
f47b173
Make pretrainer stats similar to reward signal
Jul 9, 2019
1e257d4
Merge branch 'develop' of github.com:Unity-Technologies/ml-agents int…
Jul 9, 2019
a8b5d09
Fixes after merge develop
Jul 10, 2019
fb3d5ae
Additional tests, bugfix for LSTM+BC+Visual
Jul 10, 2019
7e0a677
GAIL code cleanup
Jul 10, 2019
1953233
Add types to BCModel
Jul 10, 2019
593f819
Fix bugs with incorrect return values
Jul 11, 2019
98b7732
Change tests to use RewardSignalResult
Jul 11, 2019
6ee0c63
Add docs for pretraining and plot for all three
Jul 11, 2019
6d37be2
Fix bug with demo loading directories, add test
Jul 11, 2019
c672ad9
Add typing to BCModule, GAIL, and demo loader
Jul 11, 2019
61e84c6
Fix black
Jul 11, 2019
9d43336
Fix mypy issues
Jul 11, 2019
99a2a3c
Codacy cleanup
Jul 12, 2019
cbb1af3
Doc fixes
Jul 12, 2019
736c807
More sophisticated tests for reward signals
Jul 13, 2019
04e22fd
Fix bug in GAIL when num_sequences is 1
Jul 13, 2019
8ead02e
Clean up use_vail and feed_dicts
Jul 15, 2019
71f85e1
Change to swish from learningmodel
Jul 15, 2019
5537e60
Make variables more readable
Jul 15, 2019
73d20cb
Code and comment cleanup
Jul 15, 2019
f4950b4
Not all should be swish
Jul 15, 2019
6784ee6
Remove prints
Jul 15, 2019
2704e62
Doc updates
Jul 15, 2019
1206a89
Make VAIL default false, improve logging
Jul 15, 2019
2407a5a
Fix tests for sequences
Jul 16, 2019
4aa033b
Change max_batches and set VAIL to default to false
Jul 16, 2019
f0d7368
Minor code refactor
Jul 16, 2019
cfb88a1
Merge branch 'develop' of github.com:Unity-Technologies/ml-agents int…
Jul 19, 2019
0887859
Add gradient penalty to GAIL (GAIL-GP)
Jul 18, 2019
e346e6d
Fix NaNs in gradient penalty
Jul 19, 2019
43f8602
Only terminate value estimate for extrinsic signal
Jul 20, 2019
5eaaa76
Update imitation learning docs
Jul 20, 2019
e1ef0ed
Update docstring
Jul 20, 2019
576fbc4
Update GAIL config
Jul 20, 2019
4bbeb91
Update comments and variable/method names
Jul 22, 2019
14ef4b1
Flip flag to use_terminal_states
Jul 22, 2019
410eb00
Don't create gradient magnitude if not neccessary
Jul 22, 2019
32c0e63
Remove gamma where not needed
Jul 22, 2019
4c7b547
Fix GridWorld gamma
Jul 22, 2019
481d8e4
Merge branch 'develop' into develop-irl-ervin
Jul 22, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Fix NaNs in gradient penalty
  • Loading branch information
Ervin Teng committed Jul 19, 2019
commit e346e6db5671b60c1c5ad3b9746d4320196bda10
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import tensorflow as tf
from mlagents.trainers.models import LearningModel

EPSILON = 1e-7

class GAILModel(object):
def __init__(
Expand Down Expand Up @@ -53,7 +54,7 @@ def make_beta(self) -> None:
)
self.kl_div_input = tf.placeholder(shape=[], dtype=tf.float32)
new_beta = tf.maximum(
self.beta + self.alpha * (self.kl_div_input - self.mutual_information), 1e-7
self.beta + self.alpha * (self.kl_div_input - self.mutual_information), EPSILON
)
self.update_beta = tf.assign(self.beta, new_beta)

Expand Down Expand Up @@ -212,7 +213,7 @@ def create_network(self) -> None:
initializer=tf.ones_initializer(),
)
self.z_sigma_sq = self.z_sigma * self.z_sigma
self.z_log_sigma_sq = tf.log(self.z_sigma_sq + 1e-7)
self.z_log_sigma_sq = tf.log(self.z_sigma_sq + EPSILON)
self.use_noise = tf.placeholder(
shape=[1], dtype=tf.float32, name="NoiseLevel"
)
Expand All @@ -228,11 +229,11 @@ def create_network(self) -> None:
self.discriminator_score = tf.reshape(
self.policy_estimate, [-1], name="GAIL_reward"
)
self.intrinsic_reward = -tf.log(1.0 - self.discriminator_score + 1e-7)
self.intrinsic_reward = -tf.log(1.0 - self.discriminator_score + EPSILON)

def compute_gradient_penalty(self) -> tf.Tensor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the performance improvement from this? Faster/more stable convergence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Faster convergence, esp. for Crawler. I'm seeing about 25% less steps required with PPO. TBH a large motivation for this is for SAC, where without GP the discriminator will overfit very quickly - GAIL + SAC doesn't work at all without it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plots for Crawler + PPO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The squiggly line does not lie.

"""
Gradient penalty WGAN-GP. Adds stability esp. for off-policy.
Gradient penalty WGAN-GP. Adds stability esp. for off-policy.
Compute gradients w.r.t randomly interpolated input.
"""
expert = [self.encoded_expert, self.expert_action, self.done_expert]
Expand All @@ -252,8 +253,14 @@ def compute_gradient_penalty(self) -> tf.Tensor:
interp[2],
reuse=True,
)


grad = tf.gradients(grad_estimate, [grad_input])[0]
gradient_mag = tf.reduce_mean(tf.pow(tf.norm(grad, axis=-1) - 1, 2))

# Norm, like log, can return NaN. Use our own safe_norm
safe_norm = tf.sqrt(tf.reduce_sum(grad ** 2, axis=-1) + EPSILON)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does norm result in NaN? I could see that happening if there was overflow, but in that case adding an epsilon isn't going to help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the norm, it's the gradient of the norm. At 0 the gradient of sqrt() is a horizontal line. I'll update the comment to reflect this.

gradient_mag = tf.reduce_mean(tf.pow(safe_norm - 1, 2))

return gradient_mag

def create_loss(self, learning_rate: float) -> None:
Expand All @@ -265,8 +272,8 @@ def create_loss(self, learning_rate: float) -> None:
self.mean_policy_estimate = tf.reduce_mean(self.policy_estimate)

self.discriminator_loss = -tf.reduce_mean(
tf.log(self.expert_estimate + 1e-7)
+ tf.log(1.0 - self.policy_estimate + 1e-7)
tf.log(self.expert_estimate + EPSILON)
+ tf.log(1.0 - self.policy_estimate + EPSILON)
)

if self.use_vail:
Expand Down