上下文向量$c_i$由编码器的所有隐向量加权得到$c_i=\sum^n_{t=1}{\alpha_{i,t}h_t}$,其中$\sum_{t=1}^n{\alpha_{i,t}=1}$,$\alpha_{i,t} \geq 0$。
Additive Attention
(Additive Attention,又名 Bahdanau Attention)
$$\boldsymbol{h}_i = [\overrightarrow{\boldsymbol{h}}_i^\top; \overleftarrow{\boldsymbol{h}}_i^\top]^\top, i=1,\dots,n$$
(1)score function
$$\text{score}(\boldsymbol{s}_t, \boldsymbol{h}_i) = \mathbf{v}_a^\top \tanh(\mathbf{W}_a[\boldsymbol{s}_t; \boldsymbol{h}_i])$$
(2)alignment function
$$\alpha_{t,i} = \text{align}(y_t, x_i) = \frac{\exp(\text{score}(\boldsymbol{s}_{t-1}, \boldsymbol{h}_i))}{\sum_{i\'=1}^n \exp(\text{score}(\boldsymbol{s}_{t-1}, \boldsymbol{h}_{i\'}))}$$
(3)generate context vector function
$$\mathbf{c}_t = \sum_{i=1}^n \alpha_{t,i} \boldsymbol{h}_i$$
Content-base Attention $$\text{score}(\boldsymbol{s}_t, \boldsymbol{h}_i) = \text{cosine}[\boldsymbol{s_t},\boldsymbol{h}_i]$$
Location-base $$\text{score}(\boldsymbol{s}_t, \boldsymbol{h}_i) = \text{softmax}(\mathbf{W}_a \boldsymbol{s}_t)$$
General $$\text{score}(\boldsymbol{s}_t, \boldsymbol{h}_i) = \boldsymbol{s}^T_t \mathbf{W}_a \boldsymbol{h}_i$$
Dot-Product Attention
(1)score function
在memory中找相似:$e_i=a(q,k_i)$
(2)alignment function
计算attention权重,通常用softmax进行归一化:$\alpha_i=softmax(e_i)$
(3)generate context vector function
根据attentionweight,得到输出向量:$c=\sum_i \alpha_i v_i$
Scaled Dot-Product $$\text{score}(\boldsymbol{s}_t, \boldsymbol{h}_i) = \frac{\boldsymbol{s}^T_t \boldsymbol{h}_i}{\sqrt{n}}$$
下图是Bahdanau Attention 和 Luong Attention的对比,这两个 Attention 就是整个 Attention 的奠基之作。
(2)按alignment function分
在 Soft Attention 中,又划分了 global/local attention(In this paper :《Effective Approaches to Attention-based Neural Machine Translation》)。
Global Attentionglobal attention 是所有输入向量作为加权集合,使用 softmax 作为 alignment function。
Local Attention(半软半硬)