强化学习导论 课后习题参考 - Chapter 1,2

Reinforcement Learning: An Introduction (second edition) - Chapter 1,2 Chapter 1

1.1
Self-Play Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?

在tic-tac-toe这类完美信息的two-player的博弈中,self-play算法会收敛到纳什均衡策略(定义,收敛性理论证明)。此时学到的策略和random的对手对打会生成不同的策略,random opponent的对局中最终会利用该对手的exploitability(定义),生成一个针对该对手的策略。

1.2
Symmetries Many tic-tac-toe positions appear different but are really the same because of symmetries. How might we amend the learning process described above to take advantage of this? In what ways would this change improve the learning process? Now think again. Suppose the opponent did not take advantage of symmetries. In that case, should we? Is it true, then, that symmetrically equivalent positions should necessarily have the same value?

可以将状态和动作做对应的旋转、镜像翻转(data augmentation),这些状态可以对应相同的标签,如value label、probability label等。这种方式可以减少和环境交互的次数,提高探索效率,一次交互得到多个样本。但是,如果对手的策略并不服从对称性,那么我们无法利用数据增广的方式提高学习效率,因为对称的状态所对应的策略不再一致,也就是没有相同的value label或者probability label等,所以需要区别对待每个状态,以此利用对手策略的exploitability得到针对该对手的更好的策略。

1.3
Greedy Play Suppose the reinforcement learning player was greedy, that is, it always played the move that brought it to the position that it rated the best. Might it learn to play better, or worse, than a nongreedy player? What problems might occur?

会学到一个更坏的策略。这种情形下agent无法有效探索未知的状态,只会在当前不准确的估计上选择最优动作,从而收敛到局部最优。

1.4
Learning from Exploration Suppose learning updates occurred after all moves, including exploratory moves. If the step-size parameter is appropriately reduced over time (but not the tendency to explore), then the state values would converge to a different set of probabilities. What (conceptually) are the two sets of probabilities computed when we do, and when we do not, learn from exploratory moves? Assuming that we do continue to make exploratory moves, which set of probabilities might be better to learn? Which would result in more wins?

如果不学习探索状态指的是只不更新用探索动作到达的状态,而整条轨迹上其他用\(greedy\)策略到达的状态还是会学习的话,那么不学习探索得到的概率会收敛到最优值,因为整条轨迹其实已经包括了探索,任意状态在理论上都可以被访问无限次。而学习探索状态可能会得到一个包含探索概率的次优解。但如果探索以一个很小的概率进行,例如\(\epsilon-greedy\)的方式,那么两种学习方式都可能会收敛到最优决策,即使学到的概率里面,学习了探索的概率可能会有些许不一致,但最终做出的决策可能和不学习探索状态的决策完全一样。如果探索永远持续下去,不学习探索的策略会学得更好,获胜更多。

1.5
Other Improvements Can you think of other ways to improve the reinforcement learning player? Can you think of any better way to solve the tic-tac-toe problem as posed?

状态:可以对状态空间进行表征,简化问题;回报:若回报量级(order)变化巨大,可适量作归一化等操作;探索(exploration):根据具体问题设计适合的探索方式,如sparse reward的情形可以考虑引入好奇心(curiosity)机制,或用信息增益(information gain)等方式引入内部回报(intrinsic reward)增加探索效率;参数:根据不同问题调节参数,如折扣率(discount),步长(step size)等;算法:某些问题可以设计不同算法,如model based,model free,value based,policy based等。tic-tac-toe这类完美信息的两人博弈游戏,启发式搜索算法如蒙特卡洛树搜索(MCTS)或者AlphaZero算法应该是现存的最好选择。

Chapter 2

2.1
In \(\epsilon\)-greedy action selection, for the case of two actions and \(\epsilon\)= 0.5, what is the probability that the greedy action is selected?

\(\epsilon= 0.5\),则通过exploiting选择greedy action的概率为\(1-\epsilon= 0.5\),又由一共只有两个动作,通过exploring选到greedy action的概率为\(\epsilon/2=0.25\),综上概率为0.75。

2.2
Bandit example Consider a k-armed bandit problem with \(k = 4\) actions, denoted 1, 2, 3, and 4. Consider applying to this problem a bandit algorithm using \(\epsilon\)-greedy action selection, sample-average action-value estimates, and initial estimates of \(Q_1(a) = 0\), for all \(a\). Suppose the initial sequence of actions and rewards is \(A_1 = 1,R_1=-1,A_2=2,R_2=1,A_3=2,R_3=-2,A_4=2,R_4=2,A_5=3,R_5=0\). On some of these time steps the \(\epsilon\) case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur? On which time steps could this possibly have occurred?

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/wpffgg.html