列出一个Q值变化的表格方便说明:
time step \(Q(1)\) \(Q(2)\) \(Q(3)\) \(Q(4)\) action reward1 0 0 0 0 1 -1
2 -1 0 0 0 2 1
3 -1 1 0 0 2 -2
4 -1 -0.5 0 0 2 2
5 -1 1/3 0 0 3 0
只要看每次选择的动作是否对应最大的Q值即可,如果不是,则可以肯定是随机动作。所以time steps 4,5肯定是随机选择的动作,其他的可能是随机选择,也可能是\(greedy\) action。
2.3 In the comparison shown in Figure 2.2, which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action? How much better will it be? Express your answer quantitatively.
首先排除greedy method,因为要想找到全局最优决策,需要保证每个状态可以被无限次访问,greedy method不能保证。另外两个方法都能保证,而长期来看探索越小,选到最优动作的概率越高。理论上,\(\epsilon=0.01\)时,选到最优动作概率的上界为\(1-\epsilon+\epsilon /10 = 99.1\%\)(10-armed testbed),而\(\epsilon=0.1\)的上界为\(1-\epsilon+\epsilon /10 = 91\%\)。所以\(\epsilon=0.01\)长期看来更优,概率高8.1%。
2.4
If the step-size parameters, \(\alpha_n\), are not constant, then the estimate \(Q_n\) is a weighted average of previously received rewards with a weighting different from that given by (2.6). What is the weighting on each prior reward for the general case, analogous to (2.6), in terms of the sequence of step-size parameters?
重写(2.6)
\[\begin{array}{l} Q_{n+1} = Q_n+\alpha_n[R_n-Q_n])\\ \qquad \ \ = \alpha_n R_n+(1-\alpha_n)Q_n \\ \qquad \ \ = \alpha_n R_n+(1-\alpha_n)[\alpha_{n-1}R_{n-1}+(1-\alpha_{n-1})Q_{n-1}] \\ \qquad \ \ = \alpha_n R_n + (1-\alpha_n)\alpha_{n-1}R_{n-1}+(1-\alpha_n)(1-\alpha_{n-1})\alpha_{n-2}R_{n-2}+ \\ \qquad \qquad ... +(1-\alpha_n)(1-\alpha_{n-1})...(1-\alpha_2)\alpha_1R_1+(1-\alpha_1)^nQ_1 \\ \qquad \ \ = (1-\alpha_1)^n Q_1+\sum_{i=1}^{n}\alpha_iR_i\prod_{j=i+1}^n(1-\alpha_j) \end{array} \]