Merge pull request dennybritz#164 from byorxyz/ex.4.3

dennybritz · web-flow · commit dfef331a54b5 · 2018-05-29T07:03:48.000+02:00
updated an old link, resolve issue dennybritz#89
diff --git a/DP/Gamblers Problem Solution.ipynb b/DP/Gamblers Problem Solution.ipynb
@@ -18,7 +18,9 @@
     "\n",
     "The state is the gambler’s capital, s ∈ {1, 2, . . . , 99}.\n",
     "The actions are stakes, a ∈ {0, 1, . . . , min(s, 100 − s)}. \n",
-    "The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1.\n"
+    "The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1.\n",
+    "\n",
+    "The state-value function then gives the probability of winning from each state. A policy is a mapping from levels of capital to stakes. The optimal policy maximizes the probability of reaching the goal. Let p_h denote the probability of the coin coming up heads. If p_h is known, then the entire problem is known and it can be solved, for instance, by value iteration.\n"
    ]
   },
   {
@@ -61,7 +63,8 @@
     "    Args:\n",
     "        p_h: Probability of the coin coming up heads\n",
     "    \"\"\"\n",
-    "    # The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1.\n",
+    "    # The reward is zero on all transitions except those on which the gambler reaches his goal,\n",
+    "    # when it is +1.\n",
     "    rewards = np.zeros(101)\n",
     "    rewards[100] = 1 \n",
     "    \n",
@@ -78,15 +81,16 @@
     "            rewards: The reward vector.\n",
     "                        \n",
     "        Returns:\n",
-    "            A vector containing the expected value of each action. Its length equals to the number of actions.\n",
+    "            A vector containing the expected value of each action. \n",
+    "            Its length equals to the number of actions.\n",
     "        \"\"\"\n",
     "        A = np.zeros(101)\n",
     "        stakes = range(1, min(s, 100-s)+1) # Your minimum bet is 1, maximum bet is min(s, 100-s).\n",
     "        for a in stakes:\n",
     "            # rewards[s+a], rewards[s-a] are immediate rewards.\n",
     "            # V[s+a], V[s-a] are values of the next states.\n",
-    "            # This is the core of the Bellman equation: \n",
-    "            # The expected value of your action is the sum of immediate rewards and the value of the next state.\n",
+    "            # This is the core of the Bellman equation: The expected value of your action is \n",
+    "            # the sum of immediate rewards and the value of the next state.\n",
     "            A[a] = p_h * (rewards[s+a] + V[s+a]*discount_factor) + (1-p_h) * (rewards[s-a] + V[s-a]*discount_factor)\n",
     "        return A\n",
     "    \n",
diff --git a/DP/Gamblers Problem.ipynb b/DP/Gamblers Problem.ipynb
@@ -18,7 +18,9 @@
     "\n",
     "The state is the gambler’s capital, s ∈ {1, 2, . . . , 99}.\n",
     "The actions are stakes, a ∈ {0, 1, . . . , min(s, 100 − s)}. \n",
-    "The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1.\n"
+    "The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1.\n",
+    "\n",
+    "The state-value function then gives the probability of winning from each state. A policy is a mapping from levels of capital to stakes. The optimal policy maximizes the probability of reaching the goal. Let p_h denote the probability of the coin coming up heads. If p_h is known, then the entire problem is known and it can be solved, for instance, by value iteration.\n"
    ]
   },
   {
@@ -45,12 +47,13 @@
     "\n",
     "### Exercise 4.9 (programming)\n",
     "\n",
-    "Implement value iteration for the gambler’s problem and solve it for p_h = 0.25 and p_h = 0.55."
+    "Implement value iteration for the gambler’s problem and solve it for p_h = 0.25 and p_h = 0.55.\n",
+    "\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 1,
    "metadata": {
     "collapsed": true
    },
@@ -72,7 +75,8 @@
     "            rewards: The reward vector.\n",
     "                        \n",
     "        Returns:\n",
-    "            A vector containing the expected value of each action. Its length equals to the number of actions.\n",
+    "            A vector containing the expected value of each action. \n",
+    "            Its length equals to the number of actions.\n",
     "        \"\"\"\n",
     "        \n",
     "        # Implement!\n",
diff --git a/DQN/Breakout Playground.ipynb b/DQN/Breakout Playground.ipynb
@@ -73,7 +73,7 @@
    ],
    "source": [
     "print(\"Action space size: {}\".format(env.action_space.n))\n",
-    "print(env.get_action_meanings())\n",
+    "print(env.get_action_meanings()) # env.unwrapped.get_action_meanings() for gym 0.8.0 or later\n",
     "\n",
     "observation = env.reset()\n",
     "print(\"Observation space shape: {}\".format(observation.shape))\n",
diff --git a/DQN/README.md b/DQN/README.md
@@ -23,7 +23,7 @@
 **Required:**
 
 - [Human-Level Control through Deep Reinforcement Learning](http://www.readcube.com/articles/10.1038/nature14236)
-- [Demystifying Deep Reinforcement Learning](https://www.nervanasys.com/demystifying-deep-reinforcement-learning/)
+- [Demystifying Deep Reinforcement Learning](https://ai.intel.com/demystifying-deep-reinforcement-learning/)
 - David Silver's RL Course Lecture 6 - Value Function Approximation ([video](https://www.youtube.com/watch?v=UoPei5o4fps), [slides](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/FA.pdf))
 
 **Optional:**