|
18 | 18 | "\n", |
19 | 19 | "The state is the gambler’s capital, s ∈ {1, 2, . . . , 99}.\n", |
20 | 20 | "The actions are stakes, a ∈ {0, 1, . . . , min(s, 100 − s)}. \n", |
21 | | - "The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1.\n" |
| 21 | + "The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1.\n", |
| 22 | + "\n", |
| 23 | + "The state-value function then gives the probability of winning from each state. A policy is a mapping from levels of capital to stakes. The optimal policy maximizes the probability of reaching the goal. Let p_h denote the probability of the coin coming up heads. If p_h is known, then the entire problem is known and it can be solved, for instance, by value iteration.\n" |
22 | 24 | ] |
23 | 25 | }, |
24 | 26 | { |
|
61 | 63 | " Args:\n", |
62 | 64 | " p_h: Probability of the coin coming up heads\n", |
63 | 65 | " \"\"\"\n", |
64 | | - " # The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1.\n", |
| 66 | + " # The reward is zero on all transitions except those on which the gambler reaches his goal,\n", |
| 67 | + " # when it is +1.\n", |
65 | 68 | " rewards = np.zeros(101)\n", |
66 | 69 | " rewards[100] = 1 \n", |
67 | 70 | " \n", |
|
78 | 81 | " rewards: The reward vector.\n", |
79 | 82 | " \n", |
80 | 83 | " Returns:\n", |
81 | | - " A vector containing the expected value of each action. Its length equals to the number of actions.\n", |
| 84 | + " A vector containing the expected value of each action. \n", |
| 85 | + " Its length equals to the number of actions.\n", |
82 | 86 | " \"\"\"\n", |
83 | 87 | " A = np.zeros(101)\n", |
84 | 88 | " stakes = range(1, min(s, 100-s)+1) # Your minimum bet is 1, maximum bet is min(s, 100-s).\n", |
85 | 89 | " for a in stakes:\n", |
86 | 90 | " # rewards[s+a], rewards[s-a] are immediate rewards.\n", |
87 | 91 | " # V[s+a], V[s-a] are values of the next states.\n", |
88 | | - " # This is the core of the Bellman equation: \n", |
89 | | - " # The expected value of your action is the sum of immediate rewards and the value of the next state.\n", |
| 92 | + " # This is the core of the Bellman equation: The expected value of your action is \n", |
| 93 | + " # the sum of immediate rewards and the value of the next state.\n", |
90 | 94 | " A[a] = p_h * (rewards[s+a] + V[s+a]*discount_factor) + (1-p_h) * (rewards[s-a] + V[s-a]*discount_factor)\n", |
91 | 95 | " return A\n", |
92 | 96 | " \n", |
|
0 commit comments