Skip to content

Commit dfef331

Browse files
authored
Merge pull request dennybritz#164 from byorxyz/ex.4.3
updated an old link, resolve issue dennybritz#89
2 parents 167525b + fe3edfc commit dfef331

File tree

4 files changed

+19
-11
lines changed

4 files changed

+19
-11
lines changed

DP/Gamblers Problem Solution.ipynb

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,9 @@
1818
"\n",
1919
"The state is the gambler’s capital, s ∈ {1, 2, . . . , 99}.\n",
2020
"The actions are stakes, a ∈ {0, 1, . . . , min(s, 100 − s)}. \n",
21-
"The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1.\n"
21+
"The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1.\n",
22+
"\n",
23+
"The state-value function then gives the probability of winning from each state. A policy is a mapping from levels of capital to stakes. The optimal policy maximizes the probability of reaching the goal. Let p_h denote the probability of the coin coming up heads. If p_h is known, then the entire problem is known and it can be solved, for instance, by value iteration.\n"
2224
]
2325
},
2426
{
@@ -61,7 +63,8 @@
6163
" Args:\n",
6264
" p_h: Probability of the coin coming up heads\n",
6365
" \"\"\"\n",
64-
" # The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1.\n",
66+
" # The reward is zero on all transitions except those on which the gambler reaches his goal,\n",
67+
" # when it is +1.\n",
6568
" rewards = np.zeros(101)\n",
6669
" rewards[100] = 1 \n",
6770
" \n",
@@ -78,15 +81,16 @@
7881
" rewards: The reward vector.\n",
7982
" \n",
8083
" Returns:\n",
81-
" A vector containing the expected value of each action. Its length equals to the number of actions.\n",
84+
" A vector containing the expected value of each action. \n",
85+
" Its length equals to the number of actions.\n",
8286
" \"\"\"\n",
8387
" A = np.zeros(101)\n",
8488
" stakes = range(1, min(s, 100-s)+1) # Your minimum bet is 1, maximum bet is min(s, 100-s).\n",
8589
" for a in stakes:\n",
8690
" # rewards[s+a], rewards[s-a] are immediate rewards.\n",
8791
" # V[s+a], V[s-a] are values of the next states.\n",
88-
" # This is the core of the Bellman equation: \n",
89-
" # The expected value of your action is the sum of immediate rewards and the value of the next state.\n",
92+
" # This is the core of the Bellman equation: The expected value of your action is \n",
93+
" # the sum of immediate rewards and the value of the next state.\n",
9094
" A[a] = p_h * (rewards[s+a] + V[s+a]*discount_factor) + (1-p_h) * (rewards[s-a] + V[s-a]*discount_factor)\n",
9195
" return A\n",
9296
" \n",

DP/Gamblers Problem.ipynb

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,9 @@
1818
"\n",
1919
"The state is the gambler’s capital, s ∈ {1, 2, . . . , 99}.\n",
2020
"The actions are stakes, a ∈ {0, 1, . . . , min(s, 100 − s)}. \n",
21-
"The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1.\n"
21+
"The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1.\n",
22+
"\n",
23+
"The state-value function then gives the probability of winning from each state. A policy is a mapping from levels of capital to stakes. The optimal policy maximizes the probability of reaching the goal. Let p_h denote the probability of the coin coming up heads. If p_h is known, then the entire problem is known and it can be solved, for instance, by value iteration.\n"
2224
]
2325
},
2426
{
@@ -45,12 +47,13 @@
4547
"\n",
4648
"### Exercise 4.9 (programming)\n",
4749
"\n",
48-
"Implement value iteration for the gambler’s problem and solve it for p_h = 0.25 and p_h = 0.55."
50+
"Implement value iteration for the gambler’s problem and solve it for p_h = 0.25 and p_h = 0.55.\n",
51+
"\n"
4952
]
5053
},
5154
{
5255
"cell_type": "code",
53-
"execution_count": 2,
56+
"execution_count": 1,
5457
"metadata": {
5558
"collapsed": true
5659
},
@@ -72,7 +75,8 @@
7275
" rewards: The reward vector.\n",
7376
" \n",
7477
" Returns:\n",
75-
" A vector containing the expected value of each action. Its length equals to the number of actions.\n",
78+
" A vector containing the expected value of each action. \n",
79+
" Its length equals to the number of actions.\n",
7680
" \"\"\"\n",
7781
" \n",
7882
" # Implement!\n",

DQN/Breakout Playground.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@
7373
],
7474
"source": [
7575
"print(\"Action space size: {}\".format(env.action_space.n))\n",
76-
"print(env.get_action_meanings())\n",
76+
"print(env.get_action_meanings()) # env.unwrapped.get_action_meanings() for gym 0.8.0 or later\n",
7777
"\n",
7878
"observation = env.reset()\n",
7979
"print(\"Observation space shape: {}\".format(observation.shape))\n",

DQN/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
**Required:**
2424

2525
- [Human-Level Control through Deep Reinforcement Learning](http://www.readcube.com/articles/10.1038/nature14236)
26-
- [Demystifying Deep Reinforcement Learning](https://www.nervanasys.com/demystifying-deep-reinforcement-learning/)
26+
- [Demystifying Deep Reinforcement Learning](https://ai.intel.com/demystifying-deep-reinforcement-learning/)
2727
- David Silver's RL Course Lecture 6 - Value Function Approximation ([video](https://www.youtube.com/watch?v=UoPei5o4fps), [slides](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/FA.pdf))
2828

2929
**Optional:**

0 commit comments

Comments
 (0)