|
60 | 60 | "source": [ |
61 | 61 | "Let's have a look at some important concepts before proceeding further: \n", |
62 | 62 | "\n", |
63 | | - "* **Reward (R)**: A reward is the feedback by which we measure the success or failure of an agent’s actions. From any given state, an agent sends output in the form of actions to the environment, and the environment returns the agent’s new state (which resulted from acting on the previous state) as well as rewards, if there are any. They effectively evaluate the agent’s action.\n", |
64 | | - "* **Policy ($\\pi$)**: The policy is the strategy that the agent employs to determine the next action based on the current state. It maps states to actions, the actions that promise the highest reward. The policy that yields the highest expected utility is known as **optimal policy**. We use $\\pi^*$ to denote an optimal policy.\n", |
65 | | - "* **Discount factor ($\\gamma$)**: The discount factor is multiplied by future rewards as discovered by the agent in order to dampen thse rewards’ effect on the agent’s choice of action. Why? It is designed to make future rewards worth less than immediate rewards. If $\\gamma$ is 0.8, and there’s a reward of 10 points after 3 time steps, the present value of that reward is 0.8³ x 10. A discount factor of 1 would make future rewards worth just as much as immediate rewards.\n", |
66 | | - "* **Transition model**: The transition model describes the outcome of each action in each state. If the outcomes are stochastic, we write $P(s'|s,a)$ to denote the probability of reaching state $s'$ if the action $a$ is done in state $s$. We'll assume the transitions are **Markovian** i.e. the probability of reaching $s'$ from $s$ depends only on $s$ and not on the history of earlier states. " |
| 63 | + "* **Reward** ($R$): A reward is the feedback by which we measure the success or failure of an agent’s actions. From any given state, an agent sends output in the form of actions to the environment, and the environment returns the agent’s new state (which resulted from acting on the previous state) as well as rewards, if there are any. They effectively evaluate the agent’s action.\n", |
| 64 | + "* **Policy** ($\\pi$): The policy is the strategy that the agent employs to determine the next action based on the current state. It maps states to actions, the actions that promise the highest reward. The policy that yields the highest expected utility is known as **optimal policy**. We use $\\pi^*$ to denote an optimal policy.\n", |
| 65 | + "* **Discount factor** ($\\gamma$): The discount factor is multiplied by future rewards as discovered by the agent in order to dampen thse rewards’ effect on the agent’s choice of action. Why? It is designed to make future rewards worth less than immediate rewards. If $\\gamma$ is 0.8, and there’s a reward of 10 points after 3 time steps, the present value of that reward is 0.8³ x 10. A discount factor of 1 would make future rewards worth just as much as immediate rewards.\n", |
| 66 | + "* **Transition model**: The transition model describes the outcome of each action in each state. If the outcomes are stochastic, we write $P(s'|s,a)$ to denote the probability of reaching state $s'$ if the action $a$ is done in state $s$. We'll assume the transitions are **Markovian** i.e. the probability of reaching $s'$ from $s$ depends only on $s$ and not on the history of earlier states. \n", |
| 67 | + "* **Utility** ($U(s)$): The utility is defined to be the expected sum of discounted rewards if the policy $\\pi$ is followed from that state onward." |
| 68 | + ] |
| 69 | + }, |
| 70 | + { |
| 71 | + "cell_type": "markdown", |
| 72 | + "metadata": {}, |
| 73 | + "source": [ |
| 74 | + "## Passive Reinforcement Learning" |
| 75 | + ] |
| 76 | + }, |
| 77 | + { |
| 78 | + "cell_type": "markdown", |
| 79 | + "metadata": {}, |
| 80 | + "source": [ |
| 81 | + "In passive learning, the agent's policy $\\pi$ is fixed: in state $s$, it always executes the action $\\pi(s)$. It's goal is to learn how good a policy is - that is to learn a utility function $U^{\\pi}(s)$. Note that the passive learning agent does not know the transition model $P(s'|s,a)$, which specifies the probability of reaching state $s'$, from state $s$ after doing action $a$; nor does it know the reward function $R(s)$, which specifies the reward for each state. The agent executes a set of trials in the environment using its policy $\\pi$. In each trial, agent begins from the start-position and experience a sequence of state transition until it reaches one of the terminal states. Its percept supply both the current state and the reward receied in that state. The objective is to use the information about the rewards to learn the expected utility $U^{\\pi}(s)$ associated with each non-terminal state $s$. \n", |
| 82 | + "\n", |
| 83 | + "Since, the utility values obey the Bellman equation for a fixed policy $\\pi$, i.e. _the utility for each state equals its own reward plus the expected utility of its successors states_,\n", |
| 84 | + "\n", |
| 85 | + "$U^{\\pi}(s) = R(s) + \\gamma\\sum_{s'}P(s' | s,\\pi(s))U^\\pi(s')$" |
| 86 | + ] |
| 87 | + }, |
| 88 | + { |
| 89 | + "cell_type": "markdown", |
| 90 | + "metadata": {}, |
| 91 | + "source": [ |
| 92 | + "### Adaptive Dynamic Programming\n", |
| 93 | + "\n", |
| 94 | + "An adaptive dynamic programming agent takes advantage of the constraints among the utilities of states by learning the transition model that connects them and solving the corresponding Markov decision process using a dynamic programming method. For a passive learning agent, this means plugging a learned transition model $P(s'|s,\\pi(s))$ and the observed reward $R(s)$ into the Bellman equation to calculate the utilities of states." |
| 95 | + ] |
| 96 | + }, |
| 97 | + { |
| 98 | + "cell_type": "markdown", |
| 99 | + "metadata": {}, |
| 100 | + "source": [ |
| 101 | + "Let's have a look at the pseudo code of Passive ADP agent: " |
| 102 | + ] |
| 103 | + }, |
| 104 | + { |
| 105 | + "cell_type": "code", |
| 106 | + "execution_count": 17, |
| 107 | + "metadata": {}, |
| 108 | + "outputs": [ |
| 109 | + { |
| 110 | + "data": { |
| 111 | + "text/markdown": [ |
| 112 | + "### AIMA3e\n", |
| 113 | + "__function__ Passive-ADP-Agent(_percept_) __returns__ and action \n", |
| 114 | + " __inputs__: _percept_, a percept indication the current state _s'_ and reward signal _r'_ \n", |
| 115 | + " __persistent__: _π_, a fixed policy \n", |
| 116 | + "       _mdp_, an MDP with model _P_, rewards _R_, discount γ \n", |
| 117 | + "       _U_, a table of utilities, initially empty \n", |
| 118 | + "       _N<sub>sa</sub>_, a table of frequencies for state-action pairs, initially zero \n", |
| 119 | + "       _N<sub>s'|sa</sub>_, a table of outcome frequencies given state-action pairs, initially zero \n", |
| 120 | + "       _s_, _a_, the previous state and action, initially null \n", |
| 121 | + " __if__ _s'_ is new __then__ _U_[_s'_] ← _r'_; _R_[_s'_] ← _r'_ \n", |
| 122 | + " __if__ _s_ is not null __then__ \n", |
| 123 | + "   increment _N<sub>sa</sub>_[_s_, _a_] and _N<sub>s'|sa</sub>_[_s'_, _s_, _a_] \n", |
| 124 | + "   __for each__ _t_ such that _N<sub>s'|sa</sub>_[_t_, _s_, _a_] is nonzero __do__ \n", |
| 125 | + "     _P_(_t_ | _s_, _a_) ← _N<sub>s'|sa</sub>_[_t_, _s_, _a_] / _N<sub>sa</sub>_[_s_, _a_] \n", |
| 126 | + " _U_ ← Policy-Evaluation(_π_, _U_, _mdp_) \n", |
| 127 | + " __if__ _s'_.Terminal? __then__ _s_, _a_ ← null __else__ _s_, _a_ ← _s'_, _π_[_s'_] \n", |
| 128 | + "\n", |
| 129 | + "---\n", |
| 130 | + "__Figure ??__ A passive reinforcement learning agent based on adaptive dynamic programming. The Policy-Evaluation function solves the fixed-policy Bellman equations, as described on page ??." |
| 131 | + ], |
| 132 | + "text/plain": [ |
| 133 | + "<IPython.core.display.Markdown object>" |
| 134 | + ] |
| 135 | + }, |
| 136 | + "execution_count": 15, |
| 137 | + "metadata": {}, |
| 138 | + "output_type": "execute_result" |
| 139 | + } |
| 140 | + ], |
| 141 | + "source": [ |
| 142 | + "%%python\n", |
| 143 | + "from notebookUtils import *\n", |
| 144 | + "pseudocode('Passive ADP Agent')" |
| 145 | + ] |
| 146 | + }, |
| 147 | + { |
| 148 | + "cell_type": "markdown", |
| 149 | + "metadata": {}, |
| 150 | + "source": [ |
| 151 | + "Let's see our Passive ADP agent in action! Consider a $4*3$ cell world with $[1,1]$ as the starting position. The policy $\\pi$ for the $4*3$ world is shown in the figure below. This policy happens to be optimal with rewards of $R(s)=-0.04$ in the non-terminal states and no discounting. " |
| 152 | + ] |
| 153 | + }, |
| 154 | + { |
| 155 | + "cell_type": "markdown", |
| 156 | + "metadata": {}, |
| 157 | + "source": [ |
| 158 | + "[![Optimal Policy][1]][1]\n", |
| 159 | + "\n", |
| 160 | + "[1]: assets/optimal-policy.png" |
| 161 | + ] |
| 162 | + }, |
| 163 | + { |
| 164 | + "cell_type": "code", |
| 165 | + "execution_count": 22, |
| 166 | + "metadata": {}, |
| 167 | + "outputs": [ |
| 168 | + { |
| 169 | + "name": "stdout", |
| 170 | + "output_type": "stream", |
| 171 | + "text": [ |
| 172 | + "[1,1] \t:\t0.7128593117885544\n", |
| 173 | + "[1,2] \t:\t0.7680398391451688\n", |
| 174 | + "[1,3] \t:\t0.8178806550835265\n", |
| 175 | + "[2,1] \t:\t0.6628583416987663\n", |
| 176 | + "[2,3] \t:\t0.8746799974574001\n", |
| 177 | + "[3,1] \t:\tnull\n", |
| 178 | + "[3,2] \t:\t0.6938189410949245\n", |
| 179 | + "[3,3] \t:\t0.9241799994408929\n", |
| 180 | + "[4,1] \t:\tnull\n", |
| 181 | + "[4,2] \t:\t-1.0\n", |
| 182 | + "[4,3] \t:\t1.0\n" |
| 183 | + ] |
| 184 | + }, |
| 185 | + { |
| 186 | + "data": { |
| 187 | + "text/plain": [ |
| 188 | + "null" |
| 189 | + ] |
| 190 | + }, |
| 191 | + "execution_count": 22, |
| 192 | + "metadata": {}, |
| 193 | + "output_type": "execute_result" |
| 194 | + } |
| 195 | + ], |
| 196 | + "source": [ |
| 197 | + "import aima.core.environment.cellworld.*;\n", |
| 198 | + "import aima.core.learning.reinforcement.agent.PassiveADPAgent;\n", |
| 199 | + "import aima.core.learning.reinforcement.example.CellWorldEnvironment;\n", |
| 200 | + "import aima.core.probability.example.MDPFactory;\n", |
| 201 | + "import aima.core.probability.mdp.impl.ModifiedPolicyEvaluation;\n", |
| 202 | + "import aima.core.util.JavaRandomizer;\n", |
| 203 | + "\n", |
| 204 | + "import java.util.*;;\n", |
| 205 | + "\n", |
| 206 | + "CellWorld<Double> cw = CellWorldFactory.createCellWorldForFig17_1();;\n", |
| 207 | + "CellWorldEnvironment cwe = new CellWorldEnvironment(\n", |
| 208 | + " cw.getCellAt(1, 1),\n", |
| 209 | + " cw.getCells(),\n", |
| 210 | + " MDPFactory.createTransitionProbabilityFunctionForFigure17_1(cw),\n", |
| 211 | + " new JavaRandomizer());\n", |
| 212 | + "Map<Cell<Double>, CellWorldAction> fixedPolicy = new HashMap<Cell<Double>, CellWorldAction>();\n", |
| 213 | + "fixedPolicy.put(cw.getCellAt(1, 1), CellWorldAction.Up);\n", |
| 214 | + "fixedPolicy.put(cw.getCellAt(1, 2), CellWorldAction.Up);\n", |
| 215 | + "fixedPolicy.put(cw.getCellAt(1, 3), CellWorldAction.Right);\n", |
| 216 | + "fixedPolicy.put(cw.getCellAt(2, 1), CellWorldAction.Left);\n", |
| 217 | + "fixedPolicy.put(cw.getCellAt(2, 3), CellWorldAction.Right);\n", |
| 218 | + "fixedPolicy.put(cw.getCellAt(3, 1), CellWorldAction.Left);\n", |
| 219 | + "fixedPolicy.put(cw.getCellAt(3, 2), CellWorldAction.Up);\n", |
| 220 | + "fixedPolicy.put(cw.getCellAt(3, 3), CellWorldAction.Right);\n", |
| 221 | + "fixedPolicy.put(cw.getCellAt(4, 1), CellWorldAction.Left);\n", |
| 222 | + "PassiveADPAgent<Cell<Double>, CellWorldAction> padpa = new PassiveADPAgent<Cell<Double>, CellWorldAction>(\n", |
| 223 | + " fixedPolicy,\n", |
| 224 | + " cw.getCells(), \n", |
| 225 | + " cw.getCellAt(1, 1), \n", |
| 226 | + " MDPFactory.createActionsFunctionForFigure17_1(cw),\n", |
| 227 | + " new ModifiedPolicyEvaluation<Cell<Double>, CellWorldAction>(10,1.0));\n", |
| 228 | + "cwe.addAgent(padpa);\n", |
| 229 | + "padpa.reset();\n", |
| 230 | + "cwe.executeTrials(2000);\n", |
| 231 | + "\n", |
| 232 | + "Map<Cell<Double>, Double> U = padpa.getUtility();\n", |
| 233 | + "for(int i = 1; i<=4; i++){\n", |
| 234 | + " for(int j = 1; j<=3; j++){\n", |
| 235 | + " if(i==2 && j==2) continue; //Ignore wall\n", |
| 236 | + " System.out.println(\"[\" + i + \",\" + j + \"]\" + \" \\t:\\t\" + U.get(cw.getCellAt(i,j)));\n", |
| 237 | + " }\n", |
| 238 | + "}" |
| 239 | + ] |
| 240 | + }, |
| 241 | + { |
| 242 | + "cell_type": "markdown", |
| 243 | + "metadata": {}, |
| 244 | + "source": [ |
| 245 | + "Note that the cells $[3,1]$ and $[4,1]$ are not reachable when starting at $[1,1]$ using the policy and the default transition model i.e. 80% intended and 10% each right angle from intended." |
| 246 | + ] |
| 247 | + }, |
| 248 | + { |
| 249 | + "cell_type": "markdown", |
| 250 | + "metadata": {}, |
| 251 | + "source": [ |
| 252 | + "The learning curves of the Passive ADP agent for the $4*3$ world (given the optimal policy) are shown below." |
| 253 | + ] |
| 254 | + }, |
| 255 | + { |
| 256 | + "cell_type": "code", |
| 257 | + "execution_count": 25, |
| 258 | + "metadata": {}, |
| 259 | + "outputs": [ |
| 260 | + { |
| 261 | + "data": { |
| 262 | + "text/plain": [ |
| 263 | + "null" |
| 264 | + ] |
| 265 | + }, |
| 266 | + "execution_count": 25, |
| 267 | + "metadata": {}, |
| 268 | + "output_type": "execute_result" |
| 269 | + } |
| 270 | + ], |
| 271 | + "source": [ |
| 272 | + "import aima.core.environment.cellworld.*;\n", |
| 273 | + "import aima.core.learning.reinforcement.agent.PassiveADPAgent;\n", |
| 274 | + "import aima.core.learning.reinforcement.example.CellWorldEnvironment;\n", |
| 275 | + "import aima.core.probability.example.MDPFactory;\n", |
| 276 | + "import aima.core.probability.mdp.impl.ModifiedPolicyEvaluation;\n", |
| 277 | + "import aima.core.util.JavaRandomizer;\n", |
| 278 | + "\n", |
| 279 | + "import java.util.*;\n", |
| 280 | + "\n", |
| 281 | + "int numRuns = 20;\n", |
| 282 | + "int numTrialsPerRun = 100;\n", |
| 283 | + "int rmseTrialsToReport = 100;\n", |
| 284 | + "int reportEveryN = 1;\n", |
| 285 | + "\n", |
| 286 | + "CellWorld<Double> cw = CellWorldFactory.createCellWorldForFig17_1();;\n", |
| 287 | + "CellWorldEnvironment cwe = new CellWorldEnvironment(\n", |
| 288 | + " cw.getCellAt(1, 1),\n", |
| 289 | + " cw.getCells(),\n", |
| 290 | + " MDPFactory.createTransitionProbabilityFunctionForFigure17_1(cw),\n", |
| 291 | + " new JavaRandomizer());\n", |
| 292 | + "Map<Cell<Double>, CellWorldAction> fixedPolicy = new HashMap<Cell<Double>, CellWorldAction>();\n", |
| 293 | + "fixedPolicy.put(cw.getCellAt(1, 1), CellWorldAction.Up);\n", |
| 294 | + "fixedPolicy.put(cw.getCellAt(1, 2), CellWorldAction.Up);\n", |
| 295 | + "fixedPolicy.put(cw.getCellAt(1, 3), CellWorldAction.Right);\n", |
| 296 | + "fixedPolicy.put(cw.getCellAt(2, 1), CellWorldAction.Left);\n", |
| 297 | + "fixedPolicy.put(cw.getCellAt(2, 3), CellWorldAction.Right);\n", |
| 298 | + "fixedPolicy.put(cw.getCellAt(3, 1), CellWorldAction.Left);\n", |
| 299 | + "fixedPolicy.put(cw.getCellAt(3, 2), CellWorldAction.Up);\n", |
| 300 | + "fixedPolicy.put(cw.getCellAt(3, 3), CellWorldAction.Right);\n", |
| 301 | + "fixedPolicy.put(cw.getCellAt(4, 1), CellWorldAction.Left);\n", |
| 302 | + "PassiveADPAgent<Cell<Double>, CellWorldAction> padpa = new PassiveADPAgent<Cell<Double>, CellWorldAction>(\n", |
| 303 | + " fixedPolicy,\n", |
| 304 | + " cw.getCells(), \n", |
| 305 | + " cw.getCellAt(1, 1), \n", |
| 306 | + " MDPFactory.createActionsFunctionForFigure17_1(cw),\n", |
| 307 | + " new ModifiedPolicyEvaluation<Cell<Double>, CellWorldAction>(10,1.0));\n", |
| 308 | + "cwe.addAgent(padpa);\n", |
| 309 | + "Map<Integer, List<Map<Cell<Double>, Double>>> runs = new HashMap<Integer, List<Map<Cell<Double>, Double>>>();\n", |
| 310 | + "for (int r = 0; r < numRuns; r++) {\n", |
| 311 | + " padpa.reset();\n", |
| 312 | + " List<Map<Cell<Double>, Double>> trials = new ArrayList<Map<Cell<Double>, Double>>();\n", |
| 313 | + " for (int t = 0; t < numTrialsPerRun; t++) {\n", |
| 314 | + " cwe.executeTrial();\n", |
| 315 | + " if (0 == t % reportEveryN) {\n", |
| 316 | + " Map<Cell<Double>, Double> u = padpa.getUtility();\n", |
| 317 | + " trials.add(u);\n", |
| 318 | + " }\n", |
| 319 | + " }\n", |
| 320 | + " runs.put(r, trials);\n", |
| 321 | + "}" |
67 | 322 | ] |
68 | 323 | }, |
69 | 324 | { |
|
0 commit comments