|
22 | 22 | { |
23 | 23 | "data": { |
24 | 24 | "application/vnd.jupyter.widget-view+json": { |
25 | | - "model_id": "4478f207-b147-4b3e-aca2-34a8cfe6db14", |
| 25 | + "model_id": "92cdc6bc-eca3-4683-bfc1-8efc04e93c6f", |
26 | 26 | "version_major": 2, |
27 | 27 | "version_minor": 0 |
28 | 28 | }, |
|
680 | 680 | "cell_type": "markdown", |
681 | 681 | "metadata": {}, |
682 | 682 | "source": [ |
683 | | - "The passive TD agent does not learn quite as fast as the ADP agent and shows much higher variability, but it is much simpler and requires much lesser computation per observation. Notice that, the TD does not need the transition model to perform its updates. The environment supplies the connection between neighboring states int the form of observed transitions.\n", |
| 683 | + "The passive TD agent does not learn quite as fast as the ADP agent and shows much higher variability, but it is much simpler and requires much lesser computation per observation. Notice that, the TD does not need the transition model to perform its updates. The environment supplies the connection between neighboring states in the form of observed transitions.\n", |
684 | 684 | "\n", |
685 | 685 | "The ADP approach and the TD approach are closely related. Both try to make local adjustments to the utility estimates in order to make each state \"agree\" with its successors. One difference is TD adjusts a state to agree with its _observed_ successor, whereas ADP adjusts the state to agree with _all_ of the successors that might occur, weighted by their probabilities. " |
686 | 686 | ] |
687 | 687 | }, |
| 688 | + { |
| 689 | + "cell_type": "markdown", |
| 690 | + "metadata": {}, |
| 691 | + "source": [ |
| 692 | + "## Active Reinforcement Learning" |
| 693 | + ] |
| 694 | + }, |
| 695 | + { |
| 696 | + "cell_type": "markdown", |
| 697 | + "metadata": {}, |
| 698 | + "source": [ |
| 699 | + "A passive learning agent has a fixed policy that determines its behaviour, whereas an active agent must decide what actions to take. For this, first, it needs to learn a complete model with outcome probabilities for all actions, rather than just the model for the fixed policy. Next, we need to take into account is the fact that the agent has a choice of actions. The utilities it needs to learn are those defined by the optimal policy. Since they obey the Bellman equation:\n", |
| 700 | + "\n", |
| 701 | + "$U(s) = R(s) + \\gamma. max_{a}\\sum_{s'}P(s'|s, a)U(s')$\n", |
| 702 | + "\n", |
| 703 | + "The final issue is to learn what to do at each step. Having obtained the utility function $U$, that is optimal for the learned model, the agent can extract an optimal action by one-step look-ahead to maximize the expected utility; alternatively if it uses policy iteration, the optimal policy is already available, so it should simply execute the action the optimal policy recommends." |
| 704 | + ] |
| 705 | + }, |
| 706 | + { |
| 707 | + "cell_type": "markdown", |
| 708 | + "metadata": {}, |
| 709 | + "source": [ |
| 710 | + "But the agent that follows the recommendation of the optimal policy for the learned model at each step, **fails to learn the true utilities or the true optimal policy!** What happens instead is that, the agent greedily follows these recommendations and converges to a policy that proceeds to the terminal states through the greedy approach. It never learns the utilities of other states and thus never finds a more optimal path (if possible). We call such agent, the **greedy agent**." |
| 711 | + ] |
| 712 | + }, |
| 713 | + { |
| 714 | + "cell_type": "markdown", |
| 715 | + "metadata": {}, |
| 716 | + "source": [ |
| 717 | + "An agent therefore must make a tradeoff between **exploitation** to maximize its reward and **exploration** to maximize its long-term well being. Technically, any such scheme that will eventually lead to the optimal behaviour by the agent, need to be greedy in the limit of infinite exploration, or **GLIE**. A GLIE scheme must try each action in each state an unbounded number of times to avoid having a finite probability that an optimal action is missed because of an unusually bad series of outcomes. An agent using such an scheme will eventually learn the true environment model. There are several GLIE schemes, one of the simplest is to have the agent choose a random action a fraction $1/t$ of the time and to follow the greedy policy otherwise. While this does converge to an optimal policy, it can be extremely slow. A more sensible approach would give some weights to the action that the agent has not tried very often, while tending to avoid actions that are believed to be of low utility. This can be implemented by altering the above equation so that it assigns a higher utility estimate to relatively unexplored state-action pairs:\n", |
| 718 | + "\n", |
| 719 | + "$U^{+}(s) \\leftarrow R(s) + \\gamma. max_{a}f(\\sum_{s'}P(s'|s, a)U^{+}(s'), N(s,a))$\n", |
| 720 | + "\n", |
| 721 | + "Here,\n", |
| 722 | + "* $U^{+}(s)$ is used to denote the optimistic estimate of the utility of the state $s$.\n", |
| 723 | + "* $N(s,a)$ be the number of times action $a$ has been tried in the state $s$.\n", |
| 724 | + "* $f(u, n)$ is called the **exploration function**. It determines how greed (preference for high values of $u$) is traded off against curiosity (preference for the actions that have niot been tried often have low $n$). This function should be increasing in $u$ and decreasing in $n$." |
| 725 | + ] |
| 726 | + }, |
| 727 | + { |
| 728 | + "cell_type": "markdown", |
| 729 | + "metadata": {}, |
| 730 | + "source": [ |
| 731 | + "Now that we have an active ADP agent, let's discuss how to construct an active temporal difference learning agent. There's a method called **Q-learning**, which learns an action-utility representation. We'll use the notation $Q(s,a)$ to denote the value of doing action $a$ in state $s$. Q-values are directly related to utility values as follows:\n", |
| 732 | + "\n", |
| 733 | + "$U(s) = max_{a}Q(s,a)$\n", |
| 734 | + "\n", |
| 735 | + "Note that, a TD agent that learns a Q-function does not need a model of the form $P(s'|s,a)$, either for learning or for action selection. For this reason, Q-learning is called a **model-free** method. Therefore, the update equation for TD Q-learning is \n", |
| 736 | + "\n", |
| 737 | + "$Q(s,a) \\leftarrow Q(s,a) + \\alpha(R(s) + \\gamma.max_{a'}Q(s',a')-Q(s,a))$\n", |
| 738 | + "\n", |
| 739 | + "which is calculated whenever action $a$ is executed in state $s$ leading to state $s'$." |
| 740 | + ] |
| 741 | + }, |
| 742 | + { |
| 743 | + "cell_type": "markdown", |
| 744 | + "metadata": {}, |
| 745 | + "source": [ |
| 746 | + "Let's have look at the pseudo code of Q-Learning agent:" |
| 747 | + ] |
| 748 | + }, |
| 749 | + { |
| 750 | + "cell_type": "code", |
| 751 | + "execution_count": 3, |
| 752 | + "metadata": {}, |
| 753 | + "outputs": [ |
| 754 | + { |
| 755 | + "data": { |
| 756 | + "text/markdown": [ |
| 757 | + "##AIMA3e\n", |
| 758 | + "__function__ Q-Learning_Agent(_percept_) __returns__ an action \n", |
| 759 | + " __inputs__: _percept_, a percept indicating the current state _s'_ and reward signal _r'_ \n", |
| 760 | + " __persistent__: _Q_, a table of action values indexed by state and action, initially zero \n", |
| 761 | + "       _N<sub>sa</sub>_, a table of frequencies for state-action pairs, initially zero \n", |
| 762 | + "       _s_, _a_, _r_, the previous state, action, and reward, initially null \n", |
| 763 | + "\n", |
| 764 | + " __if__ Terminal?(_s_) then _Q_[_s_, None] ← _r'_ \n", |
| 765 | + " __if__ _s_ is not null __then__ \n", |
| 766 | + "   increment _N<sub>sa</sub>_[_s_, _a_] \n", |
| 767 | + "   _Q_[_s_, _a_] ← _Q_[_s_, _a_] + _α_(_N<sub>sa</sub>_[_s_, _a_])(_r_ + _γ_ max<sub>a'</sub> _Q_[_s'_, _a'_] - _Q_[_s_, _a_]) \n", |
| 768 | + " _s_, _a_, _r_ ← _s'_, argmax<sub>a'</sub> _f_(_Q_[_s'_, _a'_], _N<sub>sa</sub>_[_s'_, _a'_]), _r'_ \n", |
| 769 | + " __return__ _a_ \n", |
| 770 | + "\n", |
| 771 | + "---\n", |
| 772 | + "__Figure ??__ An exploratory Q-learning agent. It is an active learner that learns the value _Q_(_s_, _a_) of each action in each situation. It uses the same exploration function _f_ as the exploratory ADP agent, but avoids having to learn the transition model because the Q-value of a state can be related directly to those of its neighbors." |
| 773 | + ], |
| 774 | + "text/plain": [ |
| 775 | + "<IPython.core.display.Markdown object>" |
| 776 | + ] |
| 777 | + }, |
| 778 | + "execution_count": 2, |
| 779 | + "metadata": {}, |
| 780 | + "output_type": "execute_result" |
| 781 | + } |
| 782 | + ], |
| 783 | + "source": [ |
| 784 | + "%%python\n", |
| 785 | + "from notebookUtils import *\n", |
| 786 | + "pseudocode('Q Learning Agent')" |
| 787 | + ] |
| 788 | + }, |
688 | 789 | { |
689 | 790 | "cell_type": "code", |
690 | 791 | "execution_count": null, |
|
0 commit comments