《Playing hard exploration games by watching YouTube》论文解读 (3)

To generate training data, we sample input pairs \((v_i,w_i)\) (where \(v_i\) and \(w_i\) are sampled from the
same domain) as follows. First, we sample a demonstration sequence from our three training videos.
Next, we sample both an interval, \(d_k \in {[0],[1],[2],[3-4],[5-20],[21-200]}\), and a distance,
\(\Delta t \in d_k\). Finally, we randomly select a pair of frames from the sequence with temporal distance \(\Delta t\).
The model is trained with Adam using a learning rate of \(10^{-4}\) and batch size of 32 for 200,000 steps.

agent's network

As described in Section 4, our imitation loss is constructed by generating checkpoints every N = 16
frames along the \(\phi\)-embedded observation sequence of a single, held-aside YouTube video. We train
an agent using the sum of imitation and (optionally) environment rewards. We use the distributed
A3C RL agent IMPALA [14] with 100 actors for our experiments. The only modification we make to
the published network is to calculate the distance (as per Equation(2)) between the agent and the next
two checkpoints and concatenate this 2-vector with the flattened output of the last convolutional layer.
We also tried re-starting our agent from checkpoints recorded along its trajectory, similar to Hosu et
al. [23], but found that it provided minimal improvement given even our very long demonstrations.

效果评估

  网络训练结束后,需要一种方式来评价编码网络的特征提取能力。受CycleGAN启发,作者提出了循环一致(cycle-consistency)评估方法。
  假设有两段长度为\(N\)的序列片段\(V=\{v_1,v_2,...,v_n\}\)\(W=\{w_1,w_2,...,w_n\}\)。在提取的特征空间上定义欧氏距离\(d_{\phi}\)\(d_{\phi}(v_i,w_j)=||\phi(v_i)-\phi(w_j)||_2\)
  评估方式如下:
    先从\(V\)中挑选一帧数据\(v_i\),找出\(W\)中和\(v_i\)距离最近的一帧,\(w_j=\arg\min\limits_{w \in W}d_{\phi}(v_i,w)\)
    再从\(V\)中找出一帧和\(w_j\)距离最近的一帧,\(v_k=\arg\min\limits_{v \in V}d_{\phi}(v,w_j)\)
     如果\(v_i=v_k\),我们就称\(v_i\)是循环一致的(cycle-consistency)。
    同时再定义一个一一对应的特征匹配能力的指标,记为\(P_{\phi}\)\(P_{\phi}\)表示在特征空间\(\phi\)上特征数据\(v \in V\)是循环一致(cycle-consistent)的百分比。
  此外,根据定义也可以在\(\phi\)上定义三重循环一致(3-cycle-consistency)和匹配能力指标\(P_{\phi}^3\)。这要求\(v_i\)在三个序列\(V,W,U\)上满足\(V \rightarrow W \rightarrow U \rightarrow V\)\(V \rightarrow U \rightarrow W \rightarrow V\)
  实验效果如下:

《Playing hard exploration games by watching YouTube》论文解读



  可以看到TDC+CMC的效果是最好的。用t-SNE可视化展示如下:

《Playing hard exploration games by watching YouTube》论文解读

 2. One-shot imitation from YouTube footage

问题分析

  domain gap问题已经解决,接下来是如何利用数据进行训练。目前已经将游戏视频和agent的环境状态对应上,但依然没有动作和回报信息。作者选择了其中一个视频(a single YouTube gameplay video)根据状态序列让网络学习。也就是说前面训练特征提取的网络时,使用了多个视频,这里模仿学习只用了一个视频。按照作者的意思,一共有四个视频,用三个视频训练特征提取网络,用另一个视频让agent学习。

具体方法

  学习思路是让agent在环境中探索动作,如果得到的状态和视频的状态匹配,说明模仿到位,给一个大于0的回报,否则不给回报,也不给惩罚。具体学习方法如下:
  对于该视频,每隔\(N\)帧(\(N=16\))设置一个检查点(checkpoint),设置模仿回报
\[ r_{imitation}=\left\{\begin{array}{l}0.5 \ \ \ \ \ if \ \ \bar{\phi}(v_{agent})\cdot \bar{\phi}(v_{checkpoint})>\gamma \\0.0\ \ \ \ \ otherwise\end{array}\right.\]
  其中\(\gamma\)是衡量匹配度的阈值,\(\bar{\phi}(v)\)经过均值方差归一化(zero-centered and \(l_2\)-normalized),所以可以通过向量点乘的方式度量匹配度。
  需要注意的是,模仿的动作不必完全一样,允许网络有自己的探索,所以文章把检查点(checkpoint)设置成软顺序(soft-order)。具体说来,当上一步的checkpoint和\(v^{(n)}\)匹配时,下一个checkpoint不用必须和\(v^{(n+1)}\)匹配,只要下一个checkpoint和\(\Delta t\)时间内的状态相匹配就给予奖励,即只需要\(v_{checkpoint} \in \{v^{n+1},..,v^{n+1+\Delta t}\}\)。文中的设置为\(\Delta t =1,\gamma=0.5\),当只有图像特征没有音频时,设置\(\gamma=0.92\)效果更好。
  训练方法上,直接使用传统的强化学习算法:分布式的A3C算法(distributed A3C RL agent IMPALA)。一共使用100个actor,reward设置为模仿学习的reward:\(r_{imitation}\),或者模仿学习reward和游戏reward之和:\(r_{imitation}+r_{game}\)。这样就得到了开头视频的展示效果。
  到此,方法介绍完毕。

四、实验结果

  作者在文中给出了一些实验数据,主要有特征提取的效果度量,和模仿学习的曲线以及游戏得分。

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/zywsgf.html