Cross-modal temporal distance classification
这种方法和前一种异曲同工,上述方法是在视频不同帧之间进行预测。该方法融入声音信息,将视频和声音进行匹配,预测两者之间的时间差,也就是所谓跨模态(cross-modal)的时间差分距离预测。作者解释说游戏的音频信息通常和动作事件高度相关,比如跳、捡到道具等等,这个任务有助于网络理解游戏中的重要事件(As the audio of Atari games tends to correspond with salient
events such as jumping, obtaining items or collecting points, a network that learns to correlate audio
and visual observations should learn an abstraction that emphasizes important game events.)。尽管在强化学习的游戏环境中没有声音信息,但结合音频信息构建embedding network有助于网络学习抽象特征。
具体构造如下:
视频数据(video frame)记为\(v \in I\),音频数据(audio snippet)记为\(a \in A\)。引入另一个音频特征提取的网络,audio embedding function \(\psi:A \rightarrow R^N\)。
损失函数同样为交叉熵损失:
\(L_{cm}(v^i,a^i,y^i)=-\sum_{j=1}^Ky_j^ilog(\hat{y}_j^i) \ \ \ with \ \ \ \hat{y}^i=\tau_{cm}(\phi(v^i),\psi(a^i))\)
结构设计
最终方法结合了TDC和CMC两种方法,设置权重\(\lambda\)计算综合损失:
\(L=L_{td}+\lambda L_{cm}\)
具体网络结构如下图:
(a)图表示需要进行特征提取的原始数据,上面是视频数据,下面是音频数据。分别通过(b)图中的编码网络\(\phi(v)\)、\(\psi(a)\)提取主要特征并忽略无关的差异。最后将网络输入(c)图中的分类网络算出误差\(\tau_{td}\)和\(\tau_{cm}\),再反向传播训练网络。需要注意的是,这里\(\phi(v)\)和\(\psi(a)\)是两个独立不同的网络,\(\tau_{td}\)和\(\tau_{cm}\)网络结构相同,网络参数不同。
具体网络结构和训练数据生成的细节这里不是重点,直接粘贴原文:
The visual embedding function \(\phi\)
The visual embedding function, \(\phi\), is composed of three spatial, padded, 3x3 convolutional layers
with (32, 64, 64) channels and 2x2 max-pooling, followed by three residual-connected blocks with
64 channels and no down-sampling. Each layer is ReLU-activated and batch-normalized, and the
output fed into a 2-layer 1024-wide MLP. The network input is a 128x128x3x4 tensor constructed
by random spatial cropping of a stack of four consecutive 140x140 RGB images, sampled from our
dataset. The final embedding vector is \(l_2\)-normalized.
The audio embedding function \(\psi\)
The audio embedding function, \(\psi\) , is as per \(\phi\) except that it has four, width-8, 1D convolutional layers
with (32, 64, 128, 256) channels and 2x max-pooling, and a single width-1024 linear layer. The input
is a width-137 (6ms) sample of 256 frequency channels, calculated using STFT. ReLU-activation and
batch-normalization are applied throughout and the embedding vector is \(l_2\)-normalized.
The classification network \(\tau\)
The same shallow network architecture, \(\tau\) , is used for both temporal and cross-modal classification.
Both input vectors are combined by element-wise multiplication, with the result fed into a 2-layer
MLP with widths (1024, 6) and ReLU non-linearity in between. A visualization of these networks and
their interaction is provided in Figure 3. Note that although \(\tau_{td}\) and \(\tau_{cm}\) share the same architecture,
they are operating on two different problems and therefore maintain separate sets of weights.
Generating training data