Skip to content

Commit cc84d52

Browse files
committed
README updates
1 parent ea5724c commit cc84d52

File tree

3 files changed

+47
-8
lines changed

3 files changed

+47
-8
lines changed

PolicyGradient/Continuous MountainCar Actor Critic Solution.ipynb

Lines changed: 25 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@
118118
},
119119
{
120120
"cell_type": "code",
121-
"execution_count": 33,
121+
"execution_count": 47,
122122
"metadata": {
123123
"collapsed": false
124124
},
@@ -158,7 +158,7 @@
158158
" # Loss and train op\n",
159159
" self.loss = -self.normal_dist.log_prob(self.action) * self.target\n",
160160
" # Add cross entropy cost to encourage exploration\n",
161-
" self.loss -= 1e-4 * self.normal_dist.entropy()\n",
161+
" self.loss -= 1e-2 * self.normal_dist.entropy()\n",
162162
" \n",
163163
" self.optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)\n",
164164
" self.train_op = self.optimizer.minimize(\n",
@@ -179,7 +179,7 @@
179179
},
180180
{
181181
"cell_type": "code",
182-
"execution_count": 34,
182+
"execution_count": 48,
183183
"metadata": {
184184
"collapsed": false
185185
},
@@ -224,7 +224,7 @@
224224
},
225225
{
226226
"cell_type": "code",
227-
"execution_count": 35,
227+
"execution_count": 49,
228228
"metadata": {
229229
"collapsed": true
230230
},
@@ -303,7 +303,7 @@
303303
},
304304
{
305305
"cell_type": "code",
306-
"execution_count": null,
306+
"execution_count": 50,
307307
"metadata": {
308308
"collapsed": false
309309
},
@@ -312,7 +312,25 @@
312312
"name": "stdout",
313313
"output_type": "stream",
314314
"text": [
315-
"Step 16840 @ Episode 1/100 (0.0)"
315+
"Step 7291 @ Episode 1/50 (0.0)"
316+
]
317+
},
318+
{
319+
"ename": "KeyboardInterrupt",
320+
"evalue": "",
321+
"output_type": "error",
322+
"traceback": [
323+
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
324+
"\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)",
325+
"\u001b[0;32m<ipython-input-50-37680e81248d>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;31m# Note, due to randomness in the policy the number of episodes you need to learn a good\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0;31m# TODO: Sometimes the algorithm gets stuck, I'm not sure what exactly is happening there.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 11\u001b[0;31m \u001b[0mstats\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mactor_critic\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0menv\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpolicy_estimator\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalue_estimator\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m50\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdiscount_factor\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m0.95\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
326+
"\u001b[0;32m<ipython-input-49-47f45a7578ed>\u001b[0m in \u001b[0;36mactor_critic\u001b[0;34m(env, estimator_policy, estimator_value, num_episodes, discount_factor)\u001b[0m\n\u001b[1;32m 35\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 36\u001b[0m \u001b[0;31m# Take a step\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 37\u001b[0;31m \u001b[0maction\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mestimator_policy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstate\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 38\u001b[0m \u001b[0mnext_state\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreward\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0menv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstep\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0maction\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 39\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
327+
"\u001b[0;32m<ipython-input-47-78ba8ef0f2bb>\u001b[0m in \u001b[0;36mpredict\u001b[0;34m(self, state, sess)\u001b[0m\n\u001b[1;32m 42\u001b[0m \u001b[0msess\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msess\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mtf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_default_session\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 43\u001b[0m \u001b[0mstate\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfeaturize_state\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstate\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 44\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0msess\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0maction\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m{\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstate\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstate\u001b[0m \u001b[0;34m}\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 45\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 46\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mupdate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstate\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtarget\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maction\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msess\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
328+
"\u001b[0;32m/Users/dennybritz/venvs/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, fetches, feed_dict, options, run_metadata)\u001b[0m\n\u001b[1;32m 380\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 381\u001b[0m result = self._run(None, fetches, feed_dict, options_ptr,\n\u001b[0;32m--> 382\u001b[0;31m run_metadata_ptr)\n\u001b[0m\u001b[1;32m 383\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mrun_metadata\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 384\u001b[0m \u001b[0mproto_data\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtf_session\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTF_GetBuffer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun_metadata_ptr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
329+
"\u001b[0;32m/Users/dennybritz/venvs/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_run\u001b[0;34m(self, handle, fetches, feed_dict, options, run_metadata)\u001b[0m\n\u001b[1;32m 653\u001b[0m \u001b[0mmovers\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_update_with_movers\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfeed_dict_string\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeed_map\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 654\u001b[0m results = self._do_run(handle, target_list, unique_fetches,\n\u001b[0;32m--> 655\u001b[0;31m feed_dict_string, options, run_metadata)\n\u001b[0m\u001b[1;32m 656\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 657\u001b[0m \u001b[0;31m# User may have fetched the same tensor multiple times, but we\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
330+
"\u001b[0;32m/Users/dennybritz/venvs/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_do_run\u001b[0;34m(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)\u001b[0m\n\u001b[1;32m 721\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mhandle\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 722\u001b[0m return self._do_call(_run_fn, self._session, feed_dict, fetch_list,\n\u001b[0;32m--> 723\u001b[0;31m target_list, options, run_metadata)\n\u001b[0m\u001b[1;32m 724\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 725\u001b[0m return self._do_call(_prun_fn, self._session, handle, feed_dict,\n",
331+
"\u001b[0;32m/Users/dennybritz/venvs/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_do_call\u001b[0;34m(self, fn, *args)\u001b[0m\n\u001b[1;32m 728\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_do_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 729\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 730\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 731\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mOpError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 732\u001b[0m \u001b[0mmessage\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcompat\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mas_text\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmessage\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
332+
"\u001b[0;32m/Users/dennybritz/venvs/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py\u001b[0m in \u001b[0;36m_run_fn\u001b[0;34m(session, feed_dict, fetch_list, target_list, options, run_metadata)\u001b[0m\n\u001b[1;32m 710\u001b[0m return tf_session.TF_Run(session, options,\n\u001b[1;32m 711\u001b[0m \u001b[0mfeed_dict\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfetch_list\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtarget_list\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 712\u001b[0;31m status, run_metadata)\n\u001b[0m\u001b[1;32m 713\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 714\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_prun_fn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msession\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhandle\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeed_dict\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfetch_list\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
333+
"\u001b[0;31mKeyboardInterrupt\u001b[0m: "
316334
]
317335
}
318336
],
@@ -327,7 +345,7 @@
327345
" sess.run(tf.initialize_all_variables())\n",
328346
" # Note, due to randomness in the policy the number of episodes you need to learn a good\n",
329347
" # TODO: Sometimes the algorithm gets stuck, I'm not sure what exactly is happening there.\n",
330-
" stats = actor_critic(env, policy_estimator, value_estimator, 100, discount_factor=0.95)"
348+
" stats = actor_critic(env, policy_estimator, value_estimator, 50, discount_factor=0.95)"
331349
]
332350
},
333351
{

PolicyGradient/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,4 +53,4 @@
5353
- Implement Actor Critic with Baseline for Continuous Action Space (Exercise, [Solution](Continuous MountainCar Actor Critic Solution.ipynb))
5454
- Implement Deterministic Policy Gradients for Continuous Action Spaces (WIP)
5555
- Implement Deep Deterministic Policy Gradients (WIP)
56-
- Implement synchronous Advantage Actor Critic (A3C) (WIP)
56+
- Implement Asynchronous Advantage Actor Critic (A3C) (WIP)

README.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,27 @@ All code is written in Python 3 and use RL environments from [OpenAI Gym](https:
2424
- Exploration and Exploitation (WIP)
2525

2626

27+
### List of Implemented Algorithms
28+
29+
- [Dynamic Programming Policy Evaluation](DP/Policy Evaluation Solution.ipynb)
30+
- [Dynamic Programming Policy Iteration](DP/Policy Iteration Solution.ipynb)
31+
- [Dynamic Programming Value Iteration](DP/Value Iteration Solution.ipynb)
32+
- [Monte Carlo Prediction](MC/MC Prediction (Solution).ipynb)
33+
- [Monte Carlo Control with Epsilon-Greedy Policies](MC/MC Control with Epsilon-Greedy Policies (Solution).ipynb)
34+
- [Monte Carlo Off-Policy Control with Importance Sampling](MC/Off-Policy MC Control with Weighted Importance Sampling (Solution).ipynb)
35+
- [SARSA (On Policy TD Learning)](TD/SARSA Solution.ipynb)
36+
- [Q-Learning (Off Policy TD Learning)](TD/Q-Learning Solution.ipynb)
37+
- [Deep Q-Learning for Atari Games](DQN/Deep Q Learning Solution.ipynb))
38+
- [Double Deep-Q Learning for Atari Games](DQN/Double DQN Solution.ipynb))
39+
- Deep Q-Learning with Prioritized Experience Replay (WIP)
40+
- [Policy Gradient: REINFORCE with Baseline](PolicyGradient/CliffWalk REINFORCE with Baseline Solution.ipynb))
41+
- [Policy Gradient: Actor Critic with Baseline](PolicyGradient/CliffWalk Actor Critic Solution.ipynb))
42+
- [Policy Gradient: Actor Critic with Baseline for Continuous Action Spaces](PolicyGradient/Continuous MountainCar Actor Critic Solution.ipynb)
43+
- Deterministic Policy Gradients for Continuous Action Spaces (WIP)
44+
- Deep Deterministic Policy Gradients (DDPG) (WIP)
45+
- Asynchronous Advantage Actor Critic (A3C) (WIP)
46+
47+
2748
### Resources
2849

2950
Textbooks:

0 commit comments

Comments
 (0)